The Linux RAID-1, 4, 5 Code

A description of the implementation of the RAID-1, RAID-4 and RAID-5 personalities of the MD device driver in the Linux kernel, providing users with high performance and reliable, secondary-storage capability using software.
The Generic MD Changes

We wanted to preserve the behavior of the ll_rw_blk routine so it's as close as possible to what the client code of this routine expects. Since the code now provides error recovery, an innocent and simple-looking request can actually turn out to be a complex set of requests; therefore, the MD code now begins a configurable number of kernel threads used to arbitrate the complex requests. It exports those threads to the new personalities through the md_register_thread and the md_unregister_thread functions. The MD threads each sleep on its own wait queue and awake when needed. They then call the personality's thread and run the disk task queue.

The RAID-1 Code

The mirroring personality is the simplest one in the new code. Whenever a read request is made, it is sent to one of the operational disks in the disk array; in the case of a write request, the personality's request-making code puts a write request for each device in the array into the system-request queue.

In the event of a disk error in one of the devices while writing, the device is marked as non-operational, and a message is logged to the syslog facility to notify the operator about the situation. If this error happens during a disk read, then the code retries the read from one of the operational devices; then it puts the read request into a queue and wakes up the raid1d kernel thread.

The raid1d kernel thread has just one purpose in its life—to retry read requests. When it wakes up, it retries any queued requests to all of the operational disks.

The RAID-4 and RAID-5 Code

Both RAID-4 and RAID-5 provide block-interleaved parity; the former stores the parity in a single disk from the array, while the latter distributes the parity among all the disks in the array. Most of the code is the same in both modes. Just one routine makes the difference between the two modes.

The easiest code path is the one where all of the disks of the array are working properly. In this case, there are two code paths for the read and write modes. When asked to read a block from the disk array, the code puts the computed location of the data sector in the system request queue and no further complications arise.

In the case of a write, things are more complex since the driver has to write the corresponding block as well as update the parity block. When a single sector from the disk array is written, the code needs to do the following:

  1. Read the old contents of the data sector and the old contents of the parity sector.

  2. Compute a new parity sector.

  3. Write the new data sector with the new parity sector.

Since the upper layers expect the code to put the request on the queue, the code starts up the read requests on the disk and returns to the caller.

When any of the requests have been completed, the raid5_end_request is called. This routine together with the RAID-5 thread raid5d are responsible for keeping track of the current state of the block. If there are no problems with the disk I/O operations, the request is finally marked as finished and the upper layers can use the block.

If there is an error during block reading or writing, the RAID driver marks the faulting disk as a non-operational disk and continues operation without using it. The driver does reconstruction of lost blocks and computes the parity according to the information available on the other disks. Both block-read and block-write routines become more complex in the cases where something has gone wrong. The disk array is slower, but regular operation of the system can continue until the faulty disk is replaced.

If spare disks are configured on the disk array, the RAID driver starts a thread that re-creates the information on the disk that failed. When it finishes with the reconstruction, the disk is marked as operational, and the driver resumes operation at the regular speed.

Other Changes

We have updated the “userland” utilities that control the MD driver to make use of the features found on the new personalities.

The tool mkraid is used to configure a RAID device. It reads a configuration file and creates a RAID superblock, where all the administrative information about the device is stored.

At system boot time, the syncraid program checks the RAID superblock to make sure that the RAID system was unmounted cleanly and reconstructs the redundant information in case something went wrong.

There is a notable bottleneck in RAID-4. All of the parity information is kept on a single disk, so any write operation done on any of the data disks in the disk array will incur a write to the parity disk. For this reason, the speed limit in level 4 of the disk array is limited by the speed of the parity disk. It may seem unreasonable to have such a personality, since RAID-5 addresses this problem. We have implemented RAID-4 because, in some configurations of disk arrays, access to the disks may be serialized by the disk controller (this is the case with some IDE drives).




Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

error handing in RAID

Anonymous's picture

Actually my question is how the error handing is done in RAID and how we can check it means the different ways to check how the error handling is done?
If i am making two virtual devices and making RAID1 using these devices and writing some data and then corrupting the data on 1st disk.Then how error handing is done and is there any way to check how it is done and similarly with RAID5????

raid 1

pomps's picture


How can I do the implementation of Raid 1 on Linux.


2 HD scsi Seagate 36.6 ULTRA 320 ST336607LC
1 Adaptec 29320A board

Could you please help me?

Thanks a lot


Re: Kernel Korner: The Linux RAID-1, 4, 5 Code

Anonymous's picture

I have been using Linux MD RAID-1 for some time now and have been satisfied with its performance. I've lost two drives in this time and I feel that the simple addition of a software mirror was well worth it!

I am about to try RAID-5 in a few minutes and this article has left me feeling comfortable that I know what my kernel is doing. Thanks guys!

Are you sure the drives are

Anonymous's picture

Are you sure the drives are not dead because of miss-mapping by the md_map? I don't think this would cause any crashing, but if it is a member in a system drive array then I would think there might be a possibility of corruption and loss of md-status. I dont really know any of this, but it sure does look smart from this angle.
social cos(90)

Re: Kernel Korner: The Linux RAID-1, 4, 5 Code

Anonymous's picture

Fristy Pr0st!

UR the winningest

Anonymous's picture