Kernel Korner - I/O Schedulers

Here's how I/O schedulers contribute to disk performance in Linux and the improvements you can get from the new I/O schedulers in the 2.6 kernel.
The Linus Elevator

The I/O scheduler found in the 2.4 Linux kernel is named the Linus Elevator. I/O schedulers often are called elevator algorithms, because they tackle a problem similar to that of keeping an elevator moving smoothly in a large building. The Linus Elevator functions almost exactly like the classic I/O scheduler described above. For the most part, this was great because simplicity is a good thing and the 2.4 kernel's I/O scheduler just worked. Unfortunately, in the I/O scheduler's quest to maximize global I/O throughput, a trade-off was made: local fairness—in particular, request latency—can go easily out the window. Let's look at an example.

Consider a stream of requests to logical disk blocks such as 20, 30, 700 and 25. The I/O scheduler's sorting algorithm would queue and issue the requests in the following order (assuming the disk head currently is at the logical start of the disk): 20, 25, 30 and 700. This is expected and indeed correct. Assume, however, that halfway through servicing the request to block 25, another request comes in to the same part of the disk. And then another. And yet another. It is entirely feasible that the request way over to block 700 is not serviced for a relatively long time.

Worse, what if the request was to read a disk block? Read requests generally are synchronous. When an application issues a request to read some data, it typically blocks and waits until the kernel returns the data. The application must sit and wait, twiddling its thumbs, until that request way over at block 700 finally is serviced. Writes, on the other hand, typically are not synchronous—they are asynchronous. When an application issues a write, the kernel copies the data and metadata into the kernel, prepares a buffer to hold the data and returns to the application. The application does not really care or even know when the data actually hits the disk.

It gets worse for our friend the read request, however. Because writes are asynchronous, writes tend to stream. That is, it is common for a large writeback of a lot of data to occur. This implies that many individual write requests are submitted to a close area of the hard disk. As an example, consider saving a large file. The application dumps write requests on the system and hard drive as fast as it is scheduled.

Read requests, conversely, usually do not stream. Instead, applications submit read requests in small one-by-one chunks, with each chunk dependent on the last. Consider reading all of the files in a directory. The application opens the first file, issues a read request for a suitable chunk of the file, waits for the returned data, issues a read request for the next chunk, waits and continues likewise until the entire file is read. Then the file is closed, the next file is opened and the process repeats. Each subsequent request has to wait for the previous, which means substantial delays to this application if the requests are to far-off disk blocks. The phenomenon of streaming write requests starving dependent read requests is called writes-starving-reads (see Sidebar “Test 1. Writes-Starving-Reads”).

The possibility of not servicing some requests in a reasonable amount of time is called starvation. Request starvation results in unfairness. In the case of I/O schedulers, the system explicitly has decided to trade fairness for improved global throughput. That is, the system attempts to improve the overall performance of the system at the possible expense of any one specific I/O request. This is accepted and, indeed, desired—except that prolonged starvation is bad. Starving read requests for even moderate lengths of time results in high application latency for applications issuing read requests during other disk activity. This high read latency adversely affects system performance and feel (see Sidebar “Test 2. Effects of High Read Latency”).



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Very informative read.

Binary Soldier's picture

Thanks for taking the time to publish this article.

I'm currently studying operating system fundamentals at university and it was great to read on it's usage in the Linux operating system. I had learned how Linux treats all hardware items as 'files', but reading this has given me a much more in-depth understanding.


great article

procfs's picture

Hi this is grate, is there any other documents like this plain and symple


Best regards

Nice article

Michael R. Hines's picture

I like this. It describes the problem to the point and nothing else. Thanks for writing it.

Re: Kernel Korner: I/O Schedulers

Anonymous's picture

Elementary my dear Watson.

That's only a mere 122 fold increase in test two.

Now for your next trick.......


What about driver improvements?

Tom Callahan's picture

Your discounting the fact that there are many other performance enhancing improvements to the 2.6 kernel that have highly boosted read/write performance. Especially in the form of better supported IDE/SCSI devices and better drivers and feature sets.

I agree that anticipatory scheduler is MUCH better than the older schedulers, but recognize hardware/software improvements as well.

Anticipatory is not the best for everyone also, I recommend any person or company to run many many tests that will emulate your environment to test which scheduler is best for you...


Nice article. However, I

Anonymous's picture

Nice article. However, I agree that schedulers needs to be tested in a particular environment to be sure which choice is best.

For example, AS is a poor choice for RAID and/or virtual machines. Virtual hosts with RAID disks appear to work best with the deadline elevator. VM guests should use the noop elevator so not to disguise physical access with virtual characteristics, let alone the ineffective and unnecessary overhead of using AS in a VM guest.