Kernel Korner - I/O Schedulers
Preventing the starvation of requests in general, and read requests in particular, was a goal of the new 2.6 I/O schedulers.
The Deadline I/O Scheduler was introduced to solve the starvation issue surrounding the 2.4 I/O scheduler and traditional elevator algorithms in general. As discussed, the Linus Elevator maintains the sorted list of pending I/O requests in a single queue. The I/O request at the head of the queue is the next one to be serviced. The Deadline I/O Scheduler continues to keep this queue, but kicks things up a notch by introducing two additional queues—the read FIFO queue and the write FIFO queue. The Deadline I/O Scheduler keeps the items in each of these queues sorted by submission time (effectively, first in is first out). The read FIFO queue, as its name suggests, contains only read requests. The write FIFO queue, likewise, contains only write requests. Each FIFO queue is assigned an expiration value. The read FIFO queue has an expiration time of 500 milliseconds. The write FIFO queue has an expiration time of five seconds.
When a new I/O request is submitted, it is insertion-sorted into the standard queue and placed at the tail of its respective (either read or write) FIFO queue. Normally, the hard drive is sent I/O requests from the head of the standard sorted queue. This maximizes global throughput by minimizing seeks, as the normal queue is sorted by block number, as with the Linus Elevator.
When the item at the head of one of the FIFO queues, however, grows older than the expiration value associated with its queue, the I/O scheduler stops dispatching I/O requests from the standard queue. Instead, it services the I/O request at the head of the FIFO queue, plus a couple extra for good measure. The I/O scheduler needs to check and handle only the requests at the head of the FIFO queues, as those are the oldest requests in the queue.
Remember our old friend, the request to block 700? Despite the flood of write requests to the faraway blocks, after 500 milliseconds the Deadline I/O Scheduler would stop servicing those requests and service the read over at block 700. The disk would seek to block 700, service the read request and then continue servicing the other outstanding requests.
In this manner, the Deadline I/O Scheduler can enforce a soft deadline on I/O requests. Although it makes no promise that an I/O request is serviced before the expiration time, the I/O scheduler generally services requests near their expiration times. Thus, the Deadline I/O Scheduler continues to provide good global throughput without starving any one request for an unacceptably long time. Because read requests are given short expiration times, the writes-starving-reads problem is minimized.
This is all well and good, but it's not a perfect solution. Consider what happens with our fictional request to block 700, which presumably is the first of many dependent reads to that area of the disk. After servicing the read request, the Deadline I/O Scheduler continues servicing the write requests to the earlier blocks. This is fine, until the reading application submits its next read request (say, to block 710). In 500 milliseconds, that request expires and the disk seeks over to block 710, services the request, seeks back to where it was before and continues servicing the streaming write. And then another read arrives.
The problem again stems from those darn dependent reads. Because reads are issued in dependent chunks, the application issues the next read only when the previous is returned. But by the time the application receives the read data, is scheduled to run and submits the next read, the I/O scheduler has moved on and begun servicing some other requests. This results in a wasted pair of seeks for each read: seek to the read, service it and seek back. If only there was some way for the I/O scheduler to know—nay, to anticipate—that another read would soon be submitted to the same part of the disk. Instead of seeking back and forth, it could wait in anticipation for the next read. Saving those awful seeks certainly is worth a few milliseconds of waiting; we save two seeks.
This is, of course, exactly what the Anticipatory I/O Scheduler does. It began as the Deadline I/O Scheduler; it implements the same deadline-based scheduling. But it was gifted with the addition of an anticipation mechanism. When a read request is submitted, the Anticipatory I/O Scheduler services it within its deadline, as usual. Unlike the Deadline I/O Scheduler, however, the Anticipatory I/O Scheduler then sits and waits, doing nothing, for up to six milliseconds. Chances are good that the application will issue another read to the same part of the filesystem during those six milliseconds. If so, that request is serviced immediately, and the Anticipatory I/O Scheduler waits some more. If six milliseconds go by without a read request, the Anticipatory I/O Scheduler guessed wrong and returns to whatever it was doing before.
If even a moderate number of requests are anticipated correctly, a great deal of time (two expensive seeks, each) is saved (Table 1). Because most reads are dependent, the anticipation pays off most of the time. To further improve the odds of a correct anticipation, the Anticipatory I/O Scheduler uses a heuristic to better guess for which processes to wait. To this end, the scheduler maintains I/O statistics about each process to keep track of its behavior. Because of these statistics and intelligent heuristics, the Anticipatory I/O Scheduler correctly anticipates the actions of applications a sufficiently large amount of the time to be well worth the overhead.
Table 1. The Results
|I/O Scheduler and Kernel||Test 1||Test 2|
|Linus Elevator on 2.4||45 seconds||30 minutes, 28 seconds|
|Deadline I/O Scheduler on 2.6||40 seconds||3 minutes, 30 seconds|
|Anticipatory I/O Scheduler on 2.6||4.6 seconds||15 seconds|
By minimizing unneeded seeks and more quickly servicing read requests, in many workloads the Anticipatory I/O Scheduler provides both improved request latency and global throughput over the Deadline I/O Scheduler and the Linus Elevator. Unsurprisingly, the Anticipatory I/O Scheduler is the default I/O scheduler in the 2.6 kernel. And rightly so, it rocks.
- Papa's Got a Brand New NAS
- Applied Expert Systems, Inc.'s CleverView for TCP/IP on Linux
- Panther MPC, Inc.'s Panther Alpha
- Simplenote, Simply Awesome!
- GENIVI Alliance's GENIVI Vehicle Simulator
- The Tiny Internet Project, Part III
- Rogue Wave Software's TotalView for HPC and CodeDynamics
- NethServer: Linux without All That Linux Stuff
- Jetico's BestCrypt Container Encryption for Linux
- Sometimes My Office Goes with Me