Kernel Korner - I/O Schedulers

Here's how I/O schedulers contribute to disk performance in Linux and the improvements you can get from the new I/O schedulers in the 2.6 kernel.
The Deadline I/O Scheduler

Preventing the starvation of requests in general, and read requests in particular, was a goal of the new 2.6 I/O schedulers.

The Deadline I/O Scheduler was introduced to solve the starvation issue surrounding the 2.4 I/O scheduler and traditional elevator algorithms in general. As discussed, the Linus Elevator maintains the sorted list of pending I/O requests in a single queue. The I/O request at the head of the queue is the next one to be serviced. The Deadline I/O Scheduler continues to keep this queue, but kicks things up a notch by introducing two additional queues—the read FIFO queue and the write FIFO queue. The Deadline I/O Scheduler keeps the items in each of these queues sorted by submission time (effectively, first in is first out). The read FIFO queue, as its name suggests, contains only read requests. The write FIFO queue, likewise, contains only write requests. Each FIFO queue is assigned an expiration value. The read FIFO queue has an expiration time of 500 milliseconds. The write FIFO queue has an expiration time of five seconds.

When a new I/O request is submitted, it is insertion-sorted into the standard queue and placed at the tail of its respective (either read or write) FIFO queue. Normally, the hard drive is sent I/O requests from the head of the standard sorted queue. This maximizes global throughput by minimizing seeks, as the normal queue is sorted by block number, as with the Linus Elevator.

When the item at the head of one of the FIFO queues, however, grows older than the expiration value associated with its queue, the I/O scheduler stops dispatching I/O requests from the standard queue. Instead, it services the I/O request at the head of the FIFO queue, plus a couple extra for good measure. The I/O scheduler needs to check and handle only the requests at the head of the FIFO queues, as those are the oldest requests in the queue.

Remember our old friend, the request to block 700? Despite the flood of write requests to the faraway blocks, after 500 milliseconds the Deadline I/O Scheduler would stop servicing those requests and service the read over at block 700. The disk would seek to block 700, service the read request and then continue servicing the other outstanding requests.

In this manner, the Deadline I/O Scheduler can enforce a soft deadline on I/O requests. Although it makes no promise that an I/O request is serviced before the expiration time, the I/O scheduler generally services requests near their expiration times. Thus, the Deadline I/O Scheduler continues to provide good global throughput without starving any one request for an unacceptably long time. Because read requests are given short expiration times, the writes-starving-reads problem is minimized.

Anticipatory I/O Scheduler

This is all well and good, but it's not a perfect solution. Consider what happens with our fictional request to block 700, which presumably is the first of many dependent reads to that area of the disk. After servicing the read request, the Deadline I/O Scheduler continues servicing the write requests to the earlier blocks. This is fine, until the reading application submits its next read request (say, to block 710). In 500 milliseconds, that request expires and the disk seeks over to block 710, services the request, seeks back to where it was before and continues servicing the streaming write. And then another read arrives.

The problem again stems from those darn dependent reads. Because reads are issued in dependent chunks, the application issues the next read only when the previous is returned. But by the time the application receives the read data, is scheduled to run and submits the next read, the I/O scheduler has moved on and begun servicing some other requests. This results in a wasted pair of seeks for each read: seek to the read, service it and seek back. If only there was some way for the I/O scheduler to know—nay, to anticipate—that another read would soon be submitted to the same part of the disk. Instead of seeking back and forth, it could wait in anticipation for the next read. Saving those awful seeks certainly is worth a few milliseconds of waiting; we save two seeks.

This is, of course, exactly what the Anticipatory I/O Scheduler does. It began as the Deadline I/O Scheduler; it implements the same deadline-based scheduling. But it was gifted with the addition of an anticipation mechanism. When a read request is submitted, the Anticipatory I/O Scheduler services it within its deadline, as usual. Unlike the Deadline I/O Scheduler, however, the Anticipatory I/O Scheduler then sits and waits, doing nothing, for up to six milliseconds. Chances are good that the application will issue another read to the same part of the filesystem during those six milliseconds. If so, that request is serviced immediately, and the Anticipatory I/O Scheduler waits some more. If six milliseconds go by without a read request, the Anticipatory I/O Scheduler guessed wrong and returns to whatever it was doing before.

If even a moderate number of requests are anticipated correctly, a great deal of time (two expensive seeks, each) is saved (Table 1). Because most reads are dependent, the anticipation pays off most of the time. To further improve the odds of a correct anticipation, the Anticipatory I/O Scheduler uses a heuristic to better guess for which processes to wait. To this end, the scheduler maintains I/O statistics about each process to keep track of its behavior. Because of these statistics and intelligent heuristics, the Anticipatory I/O Scheduler correctly anticipates the actions of applications a sufficiently large amount of the time to be well worth the overhead.

Table 1. The Results

I/O Scheduler and KernelTest 1Test 2
Linus Elevator on 2.445 seconds30 minutes, 28 seconds
Deadline I/O Scheduler on 2.640 seconds3 minutes, 30 seconds
Anticipatory I/O Scheduler on 2.64.6 seconds15 seconds

By minimizing unneeded seeks and more quickly servicing read requests, in many workloads the Anticipatory I/O Scheduler provides both improved request latency and global throughput over the Deadline I/O Scheduler and the Linus Elevator. Unsurprisingly, the Anticipatory I/O Scheduler is the default I/O scheduler in the 2.6 kernel. And rightly so, it rocks.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Very informative read.

Binary Soldier's picture

Thanks for taking the time to publish this article.

I'm currently studying operating system fundamentals at university and it was great to read on it's usage in the Linux operating system. I had learned how Linux treats all hardware items as 'files', but reading this has given me a much more in-depth understanding.


great article

procfs's picture

Hi this is grate, is there any other documents like this plain and symple


Best regards

Nice article

Michael R. Hines's picture

I like this. It describes the problem to the point and nothing else. Thanks for writing it.

Re: Kernel Korner: I/O Schedulers

Anonymous's picture

Elementary my dear Watson.

That's only a mere 122 fold increase in test two.

Now for your next trick.......


What about driver improvements?

Tom Callahan's picture

Your discounting the fact that there are many other performance enhancing improvements to the 2.6 kernel that have highly boosted read/write performance. Especially in the form of better supported IDE/SCSI devices and better drivers and feature sets.

I agree that anticipatory scheduler is MUCH better than the older schedulers, but recognize hardware/software improvements as well.

Anticipatory is not the best for everyone also, I recommend any person or company to run many many tests that will emulate your environment to test which scheduler is best for you...


Nice article. However, I

Anonymous's picture

Nice article. However, I agree that schedulers needs to be tested in a particular environment to be sure which choice is best.

For example, AS is a poor choice for RAID and/or virtual machines. Virtual hosts with RAID disks appear to work best with the deadline elevator. VM guests should use the noop elevator so not to disguise physical access with virtual characteristics, let alone the ineffective and unnecessary overhead of using AS in a VM guest.