The Linux RAID-1, 4, 5 Code
We wanted to preserve the behavior of the ll_rw_blk routine so it's as close as possible to what the client code of this routine expects. Since the code now provides error recovery, an innocent and simple-looking request can actually turn out to be a complex set of requests; therefore, the MD code now begins a configurable number of kernel threads used to arbitrate the complex requests. It exports those threads to the new personalities through the md_register_thread and the md_unregister_thread functions. The MD threads each sleep on its own wait queue and awake when needed. They then call the personality's thread and run the disk task queue.
The mirroring personality is the simplest one in the new code. Whenever a read request is made, it is sent to one of the operational disks in the disk array; in the case of a write request, the personality's request-making code puts a write request for each device in the array into the system-request queue.
In the event of a disk error in one of the devices while writing, the device is marked as non-operational, and a message is logged to the syslog facility to notify the operator about the situation. If this error happens during a disk read, then the code retries the read from one of the operational devices; then it puts the read request into a queue and wakes up the raid1d kernel thread.
The raid1d kernel thread has just one purpose in its life—to retry read requests. When it wakes up, it retries any queued requests to all of the operational disks.
Both RAID-4 and RAID-5 provide block-interleaved parity; the former stores the parity in a single disk from the array, while the latter distributes the parity among all the disks in the array. Most of the code is the same in both modes. Just one routine makes the difference between the two modes.
The easiest code path is the one where all of the disks of the array are working properly. In this case, there are two code paths for the read and write modes. When asked to read a block from the disk array, the code puts the computed location of the data sector in the system request queue and no further complications arise.
In the case of a write, things are more complex since the driver has to write the corresponding block as well as update the parity block. When a single sector from the disk array is written, the code needs to do the following:
Read the old contents of the data sector and the old contents of the parity sector.
Compute a new parity sector.
Write the new data sector with the new parity sector.
Since the upper layers expect the code to put the request on the queue, the code starts up the read requests on the disk and returns to the caller.
When any of the requests have been completed, the raid5_end_request is called. This routine together with the RAID-5 thread raid5d are responsible for keeping track of the current state of the block. If there are no problems with the disk I/O operations, the request is finally marked as finished and the upper layers can use the block.
If there is an error during block reading or writing, the RAID driver marks the faulting disk as a non-operational disk and continues operation without using it. The driver does reconstruction of lost blocks and computes the parity according to the information available on the other disks. Both block-read and block-write routines become more complex in the cases where something has gone wrong. The disk array is slower, but regular operation of the system can continue until the faulty disk is replaced.
If spare disks are configured on the disk array, the RAID driver starts a thread that re-creates the information on the disk that failed. When it finishes with the reconstruction, the disk is marked as operational, and the driver resumes operation at the regular speed.
We have updated the “userland” utilities that control the MD driver to make use of the features found on the new personalities.
The tool mkraid is used to configure a RAID device. It reads a configuration file and creates a RAID superblock, where all the administrative information about the device is stored.
At system boot time, the syncraid program checks the RAID superblock to make sure that the RAID system was unmounted cleanly and reconstructs the redundant information in case something went wrong.
There is a notable bottleneck in RAID-4. All of the parity information is kept on a single disk, so any write operation done on any of the data disks in the disk array will incur a write to the parity disk. For this reason, the speed limit in level 4 of the disk array is limited by the speed of the parity disk. It may seem unreasonable to have such a personality, since RAID-5 addresses this problem. We have implemented RAID-4 because, in some configurations of disk arrays, access to the disks may be serialized by the disk controller (this is the case with some IDE drives).
|Using Salt Stack and Vagrant for Drupal Development||May 20, 2013|
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
|Non-Linux FOSS: Seashore||May 10, 2013|
|Trying to Tame the Tablet||May 08, 2013|
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Tech Tip: Really Simple HTTP Server with Python
- Home, My Backup Data Center
- Download the Free Red Hat White Paper "Using an Open Source Framework to Catch the Bad Guy"
- Android is Linux -- why no better inter-operation
36 min 26 sec ago
- Connecting Android device to desktop Linux via USB
1 hour 4 min ago
- Find new cell phone and tablet pc
2 hours 3 min ago
3 hours 31 min ago
- Automatically updating Guest Additions
4 hours 40 min ago
- I like your topic on android
5 hours 26 min ago
- Reply to comment | Linux Journal
5 hours 48 min ago
- This is the easiest tutorial
12 hours 2 min ago
- Ahh, the Koolaid.
17 hours 41 min ago
- git-annex assistant
23 hours 40 min ago
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.