The Linux RAID-1, 4, 5 Code
We wanted to preserve the behavior of the ll_rw_blk routine so it's as close as possible to what the client code of this routine expects. Since the code now provides error recovery, an innocent and simple-looking request can actually turn out to be a complex set of requests; therefore, the MD code now begins a configurable number of kernel threads used to arbitrate the complex requests. It exports those threads to the new personalities through the md_register_thread and the md_unregister_thread functions. The MD threads each sleep on its own wait queue and awake when needed. They then call the personality's thread and run the disk task queue.
The mirroring personality is the simplest one in the new code. Whenever a read request is made, it is sent to one of the operational disks in the disk array; in the case of a write request, the personality's request-making code puts a write request for each device in the array into the system-request queue.
In the event of a disk error in one of the devices while writing, the device is marked as non-operational, and a message is logged to the syslog facility to notify the operator about the situation. If this error happens during a disk read, then the code retries the read from one of the operational devices; then it puts the read request into a queue and wakes up the raid1d kernel thread.
The raid1d kernel thread has just one purpose in its life—to retry read requests. When it wakes up, it retries any queued requests to all of the operational disks.
Both RAID-4 and RAID-5 provide block-interleaved parity; the former stores the parity in a single disk from the array, while the latter distributes the parity among all the disks in the array. Most of the code is the same in both modes. Just one routine makes the difference between the two modes.
The easiest code path is the one where all of the disks of the array are working properly. In this case, there are two code paths for the read and write modes. When asked to read a block from the disk array, the code puts the computed location of the data sector in the system request queue and no further complications arise.
In the case of a write, things are more complex since the driver has to write the corresponding block as well as update the parity block. When a single sector from the disk array is written, the code needs to do the following:
Read the old contents of the data sector and the old contents of the parity sector.
Compute a new parity sector.
Write the new data sector with the new parity sector.
Since the upper layers expect the code to put the request on the queue, the code starts up the read requests on the disk and returns to the caller.
When any of the requests have been completed, the raid5_end_request is called. This routine together with the RAID-5 thread raid5d are responsible for keeping track of the current state of the block. If there are no problems with the disk I/O operations, the request is finally marked as finished and the upper layers can use the block.
If there is an error during block reading or writing, the RAID driver marks the faulting disk as a non-operational disk and continues operation without using it. The driver does reconstruction of lost blocks and computes the parity according to the information available on the other disks. Both block-read and block-write routines become more complex in the cases where something has gone wrong. The disk array is slower, but regular operation of the system can continue until the faulty disk is replaced.
If spare disks are configured on the disk array, the RAID driver starts a thread that re-creates the information on the disk that failed. When it finishes with the reconstruction, the disk is marked as operational, and the driver resumes operation at the regular speed.
We have updated the “userland” utilities that control the MD driver to make use of the features found on the new personalities.
The tool mkraid is used to configure a RAID device. It reads a configuration file and creates a RAID superblock, where all the administrative information about the device is stored.
At system boot time, the syncraid program checks the RAID superblock to make sure that the RAID system was unmounted cleanly and reconstructs the redundant information in case something went wrong.
There is a notable bottleneck in RAID-4. All of the parity information is kept on a single disk, so any write operation done on any of the data disks in the disk array will incur a write to the parity disk. For this reason, the speed limit in level 4 of the disk array is limited by the speed of the parity disk. It may seem unreasonable to have such a personality, since RAID-5 addresses this problem. We have implemented RAID-4 because, in some configurations of disk arrays, access to the disks may be serialized by the disk controller (this is the case with some IDE drives).