Kernel Korner - I/O Schedulers
Although most Linux users are familiar with the role of process schedulers, such as the new O(1) scheduler, many users are not so familiar with the role of I/O schedulers. I/O schedulers are similar in some aspects to process schedulers; for instance, both schedule some resource among multiple users. A process scheduler virtualizes the resource of processor time among multiple executing processes on the system. So, what does an I/O scheduler schedule?
A naïve system would not even include an I/O scheduler. Unlike the process scheduler, the I/O scheduler is not a mandatory component of the operating system. Instead, performance is the I/O scheduler's sole raison d'être.
To understand the role of an I/O scheduler, let's go over some background information and then look at how a system behaves without an I/O scheduler. Hard disks address their data using the familiar geometry-based addressing of cylinders, heads and sectors. A hard drive is composed of multiple platters, each consisting of a single disk, spindle and read/write head. Each platter is divided further into circular ring-like tracks, similar to a CD or record. Finally, each track is composed of some integer number of sectors.
To locate a specific unit of data in a hard drive, the drive's logic requires three pieces of information: the cylinder, the head and the sector. The cylinder specifies the track on which the data resides. If you lay the platters on top of one another (as they are in a hard disk), a given track forms a cylinder through each platter. The head then identifies the exact read/write head (and thus the exact platter) in question. The search now is narrowed down to a single track on a single platter. Finally, the sector value denotes the exact sector on the track. The search is complete: the hard disk knows what sector, on what track, on what platter the data resides. It can position the read/write head of the correct platter over the correct track and read the proper sector.
Thankfully, modern hard disks do not force computers to communicate with them in terms of cylinders, heads and sectors. Instead, modern hard drives map a unique block number over each cylinder/head/sector triplet. The unique number identifies a specific cylinder/head/sector value. Modern operating systems then can address hard drives using this block number—called logical block addressing—and the hard drive translates the block number into the correct cylinder/head/sector value.
One thing of note about this block number: although nothing guarantees it, the physical mapping tends to be sequential. That is, logical block n tends to be physically adjacent to logical block n+1. We discuss why that is important later on.
Now, let's consider the typical UNIX system. Applications as varied as databases, e-mail clients, Web servers and text editors issue I/O requests to the disk, such as read this block and write to that block. The blocks tend to be located physically all over the disk. The e-mail spool may be located in an entirely different region of the disk from the Web server's HTML data or the text editor's configuration file. Indeed, even a single file can be strewn all over the disk if the file is fragmented, that is, not laid out in sequential blocks. Because the files are broken down into individual blocks, and hard drives are addressed by block and not the much more abstract concepts of files, reading or writing file data is broken down into a stream of many individual I/O requests, each to a different block. With luck, the blocks are sequential or at least physically close together. If the blocks are not near one another, the disk head must move to another location on the disk. Moving the disk head is called seeking, and it is one of the most expensive operations in a computer. The seek time on modern hard drives is measured in the tens of milliseconds. This is one reason why defragmented files are a good thing.
Unfortunately, it does not matter if the files are defragmented because the system is generating I/O requests for multiple files, all over the disk. The e-mail client wants a little from here and the Web server wants a little from there—but wait, now the text editor wants to read a file. The net effect is that the disk head is made to jump around the disk. In a worst-case scenario, with interleaved I/O requests to multiple files, the head can spend all of its time jumping around from one location to another—not a good thing for overall system performance.
This is where the I/O scheduler comes in. The I/O scheduler schedules the pending I/O requests in order to minimize the time spent moving the disk head. This, in turn, minimizes disk seek time and maximizes hard disk throughput.
This magic is accomplished through two main actions, sorting and merging. First, the I/O scheduler keeps the list of pending I/O requests sorted by block number. When a new I/O request is issued, it is inserted, block-wise, into the list of pending requests. This prevents the drive head from seeking all around the disk to service I/O requests. Instead, by keeping the list sorted, the disk head moves in a straight line around the disk. If the hard drive is busy servicing a request at one part of the disk, and a new request comes in to the same part of the disk, that request can be serviced before moving off to other parts of the disk.
Merging occurs when an I/O request is issued to an identical or adjacent region of the disk. Instead of issuing the new request on its own, it is merged into the identical or adjacent request. This minimizes the number of outstanding requests.
Let's look at an example. Consider the case where two applications issue requests to the following block numbers, such that they arrived in the kernel in this order: 10, 500, 12, 502, 14, 504 and 12. The I/O scheduler-less approach would service these blocks in the given order. That is seven long seeks, back and forth between two parts of the disk. What a waste! If the kernel sorted and merged these requests, however, and serviced them in that order, the results would be much different: 10, 12, 14, 500, 502 and 504. Only a single far-off seek and one less request overall.
In this manner, an I/O scheduler virtualizes the resources of disk I/O among multiple I/O requests in order to maximize global throughput. Because I/O throughput is so crucial to system performance and because seek time is so horribly slow, the job of an I/O scheduler is important.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
- Using Salt Stack and Vagrant for Drupal Development
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- New Products
- RSS Feeds
- New Products
- I like your topic on android
15 min 49 sec ago - Reply to comment | Linux Journal
36 min 59 sec ago - This is the easiest tutorial
6 hours 51 min ago - Ahh, the Koolaid.
12 hours 30 min ago - git-annex assistant
18 hours 29 min ago - direct cable connection
18 hours 52 min ago - Agreed on AirDroid. With my
19 hours 2 min ago - I just learned this
19 hours 6 min ago - enterprise
19 hours 36 min ago - not living upto the mobile revolution
22 hours 27 min ago
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




Comments
Very informative read.
Thanks for taking the time to publish this article.
I'm currently studying operating system fundamentals at university and it was great to read on it's usage in the Linux operating system. I had learned how Linux treats all hardware items as 'files', but reading this has given me a much more in-depth understanding.
*Bookmarked*
great article
Hi this is grate, is there any other documents like this plain and symple
Thanks
Best regards
Nice article
I like this. It describes the problem to the point and nothing else. Thanks for writing it.
Re: Kernel Korner: I/O Schedulers
Elementary my dear Watson.
That's only a mere 122 fold increase in test two.
Now for your next trick.......
Mick.
What about driver improvements?
Your discounting the fact that there are many other performance enhancing improvements to the 2.6 kernel that have highly boosted read/write performance. Especially in the form of better supported IDE/SCSI devices and better drivers and feature sets.
I agree that anticipatory scheduler is MUCH better than the older schedulers, but recognize hardware/software improvements as well.
Anticipatory is not the best for everyone also, I recommend any person or company to run many many tests that will emulate your environment to test which scheduler is best for you...
-Tom
Nice article. However, I
Nice article. However, I agree that schedulers needs to be tested in a particular environment to be sure which choice is best.
For example, AS is a poor choice for RAID and/or virtual machines. Virtual hosts with RAID disks appear to work best with the deadline elevator. VM guests should use the noop elevator so not to disguise physical access with virtual characteristics, let alone the ineffective and unnecessary overhead of using AS in a VM guest.
-shubes