Kernel Korner - I/O Schedulers
Preventing the starvation of requests in general, and read requests in particular, was a goal of the new 2.6 I/O schedulers.
The Deadline I/O Scheduler was introduced to solve the starvation issue surrounding the 2.4 I/O scheduler and traditional elevator algorithms in general. As discussed, the Linus Elevator maintains the sorted list of pending I/O requests in a single queue. The I/O request at the head of the queue is the next one to be serviced. The Deadline I/O Scheduler continues to keep this queue, but kicks things up a notch by introducing two additional queues—the read FIFO queue and the write FIFO queue. The Deadline I/O Scheduler keeps the items in each of these queues sorted by submission time (effectively, first in is first out). The read FIFO queue, as its name suggests, contains only read requests. The write FIFO queue, likewise, contains only write requests. Each FIFO queue is assigned an expiration value. The read FIFO queue has an expiration time of 500 milliseconds. The write FIFO queue has an expiration time of five seconds.
When a new I/O request is submitted, it is insertion-sorted into the standard queue and placed at the tail of its respective (either read or write) FIFO queue. Normally, the hard drive is sent I/O requests from the head of the standard sorted queue. This maximizes global throughput by minimizing seeks, as the normal queue is sorted by block number, as with the Linus Elevator.
When the item at the head of one of the FIFO queues, however, grows older than the expiration value associated with its queue, the I/O scheduler stops dispatching I/O requests from the standard queue. Instead, it services the I/O request at the head of the FIFO queue, plus a couple extra for good measure. The I/O scheduler needs to check and handle only the requests at the head of the FIFO queues, as those are the oldest requests in the queue.
Remember our old friend, the request to block 700? Despite the flood of write requests to the faraway blocks, after 500 milliseconds the Deadline I/O Scheduler would stop servicing those requests and service the read over at block 700. The disk would seek to block 700, service the read request and then continue servicing the other outstanding requests.
In this manner, the Deadline I/O Scheduler can enforce a soft deadline on I/O requests. Although it makes no promise that an I/O request is serviced before the expiration time, the I/O scheduler generally services requests near their expiration times. Thus, the Deadline I/O Scheduler continues to provide good global throughput without starving any one request for an unacceptably long time. Because read requests are given short expiration times, the writes-starving-reads problem is minimized.
This is all well and good, but it's not a perfect solution. Consider what happens with our fictional request to block 700, which presumably is the first of many dependent reads to that area of the disk. After servicing the read request, the Deadline I/O Scheduler continues servicing the write requests to the earlier blocks. This is fine, until the reading application submits its next read request (say, to block 710). In 500 milliseconds, that request expires and the disk seeks over to block 710, services the request, seeks back to where it was before and continues servicing the streaming write. And then another read arrives.
The problem again stems from those darn dependent reads. Because reads are issued in dependent chunks, the application issues the next read only when the previous is returned. But by the time the application receives the read data, is scheduled to run and submits the next read, the I/O scheduler has moved on and begun servicing some other requests. This results in a wasted pair of seeks for each read: seek to the read, service it and seek back. If only there was some way for the I/O scheduler to know—nay, to anticipate—that another read would soon be submitted to the same part of the disk. Instead of seeking back and forth, it could wait in anticipation for the next read. Saving those awful seeks certainly is worth a few milliseconds of waiting; we save two seeks.
This is, of course, exactly what the Anticipatory I/O Scheduler does. It began as the Deadline I/O Scheduler; it implements the same deadline-based scheduling. But it was gifted with the addition of an anticipation mechanism. When a read request is submitted, the Anticipatory I/O Scheduler services it within its deadline, as usual. Unlike the Deadline I/O Scheduler, however, the Anticipatory I/O Scheduler then sits and waits, doing nothing, for up to six milliseconds. Chances are good that the application will issue another read to the same part of the filesystem during those six milliseconds. If so, that request is serviced immediately, and the Anticipatory I/O Scheduler waits some more. If six milliseconds go by without a read request, the Anticipatory I/O Scheduler guessed wrong and returns to whatever it was doing before.
If even a moderate number of requests are anticipated correctly, a great deal of time (two expensive seeks, each) is saved (Table 1). Because most reads are dependent, the anticipation pays off most of the time. To further improve the odds of a correct anticipation, the Anticipatory I/O Scheduler uses a heuristic to better guess for which processes to wait. To this end, the scheduler maintains I/O statistics about each process to keep track of its behavior. Because of these statistics and intelligent heuristics, the Anticipatory I/O Scheduler correctly anticipates the actions of applications a sufficiently large amount of the time to be well worth the overhead.
Table 1. The Results
| I/O Scheduler and Kernel | Test 1 | Test 2 |
|---|---|---|
| Linus Elevator on 2.4 | 45 seconds | 30 minutes, 28 seconds |
| Deadline I/O Scheduler on 2.6 | 40 seconds | 3 minutes, 30 seconds |
| Anticipatory I/O Scheduler on 2.6 | 4.6 seconds | 15 seconds |
By minimizing unneeded seeks and more quickly servicing read requests, in many workloads the Anticipatory I/O Scheduler provides both improved request latency and global throughput over the Deadline I/O Scheduler and the Linus Elevator. Unsurprisingly, the Anticipatory I/O Scheduler is the default I/O scheduler in the 2.6 kernel. And rightly so, it rocks.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- New Products
- RSS Feeds
- Tech Tip: Really Simple HTTP Server with Python
- Epistle
1 hour 16 min ago - Automatically updating Guest Additions
2 hours 25 min ago - I like your topic on android
3 hours 11 min ago - Reply to comment | Linux Journal
3 hours 33 min ago - This is the easiest tutorial
9 hours 47 min ago - Ahh, the Koolaid.
15 hours 26 min ago - git-annex assistant
21 hours 25 min ago - direct cable connection
21 hours 48 min ago - Agreed on AirDroid. With my
21 hours 58 min ago - I just learned this
22 hours 2 min ago
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




Comments
Very informative read.
Thanks for taking the time to publish this article.
I'm currently studying operating system fundamentals at university and it was great to read on it's usage in the Linux operating system. I had learned how Linux treats all hardware items as 'files', but reading this has given me a much more in-depth understanding.
*Bookmarked*
great article
Hi this is grate, is there any other documents like this plain and symple
Thanks
Best regards
Nice article
I like this. It describes the problem to the point and nothing else. Thanks for writing it.
Re: Kernel Korner: I/O Schedulers
Elementary my dear Watson.
That's only a mere 122 fold increase in test two.
Now for your next trick.......
Mick.
What about driver improvements?
Your discounting the fact that there are many other performance enhancing improvements to the 2.6 kernel that have highly boosted read/write performance. Especially in the form of better supported IDE/SCSI devices and better drivers and feature sets.
I agree that anticipatory scheduler is MUCH better than the older schedulers, but recognize hardware/software improvements as well.
Anticipatory is not the best for everyone also, I recommend any person or company to run many many tests that will emulate your environment to test which scheduler is best for you...
-Tom
Nice article. However, I
Nice article. However, I agree that schedulers needs to be tested in a particular environment to be sure which choice is best.
For example, AS is a poor choice for RAID and/or virtual machines. Virtual hosts with RAID disks appear to work best with the deadline elevator. VM guests should use the noop elevator so not to disguise physical access with virtual characteristics, let alone the ineffective and unnecessary overhead of using AS in a VM guest.
-shubes