Kernel Korner - I/O Schedulers
The I/O scheduler found in the 2.4 Linux kernel is named the Linus Elevator. I/O schedulers often are called elevator algorithms, because they tackle a problem similar to that of keeping an elevator moving smoothly in a large building. The Linus Elevator functions almost exactly like the classic I/O scheduler described above. For the most part, this was great because simplicity is a good thing and the 2.4 kernel's I/O scheduler just worked. Unfortunately, in the I/O scheduler's quest to maximize global I/O throughput, a trade-off was made: local fairness—in particular, request latency—can go easily out the window. Let's look at an example.
Consider a stream of requests to logical disk blocks such as 20, 30, 700 and 25. The I/O scheduler's sorting algorithm would queue and issue the requests in the following order (assuming the disk head currently is at the logical start of the disk): 20, 25, 30 and 700. This is expected and indeed correct. Assume, however, that halfway through servicing the request to block 25, another request comes in to the same part of the disk. And then another. And yet another. It is entirely feasible that the request way over to block 700 is not serviced for a relatively long time.
Worse, what if the request was to read a disk block? Read requests generally are synchronous. When an application issues a request to read some data, it typically blocks and waits until the kernel returns the data. The application must sit and wait, twiddling its thumbs, until that request way over at block 700 finally is serviced. Writes, on the other hand, typically are not synchronous—they are asynchronous. When an application issues a write, the kernel copies the data and metadata into the kernel, prepares a buffer to hold the data and returns to the application. The application does not really care or even know when the data actually hits the disk.
It gets worse for our friend the read request, however. Because writes are asynchronous, writes tend to stream. That is, it is common for a large writeback of a lot of data to occur. This implies that many individual write requests are submitted to a close area of the hard disk. As an example, consider saving a large file. The application dumps write requests on the system and hard drive as fast as it is scheduled.
Read requests, conversely, usually do not stream. Instead, applications submit read requests in small one-by-one chunks, with each chunk dependent on the last. Consider reading all of the files in a directory. The application opens the first file, issues a read request for a suitable chunk of the file, waits for the returned data, issues a read request for the next chunk, waits and continues likewise until the entire file is read. Then the file is closed, the next file is opened and the process repeats. Each subsequent request has to wait for the previous, which means substantial delays to this application if the requests are to far-off disk blocks. The phenomenon of streaming write requests starving dependent read requests is called writes-starving-reads (see Sidebar “Test 1. Writes-Starving-Reads”).
Test 1. Writes-Starving-Reads
In the background, perform a streaming write, such as:
while true
do
dd if=/dev/zero of=file bs=1M
done
Meanwhile, time how long a simple read of a 200MB file takes:
time cat 200mb-file > /dev/null
This test demonstrates the writes-starving-reads problem.
The possibility of not servicing some requests in a reasonable amount of time is called starvation. Request starvation results in unfairness. In the case of I/O schedulers, the system explicitly has decided to trade fairness for improved global throughput. That is, the system attempts to improve the overall performance of the system at the possible expense of any one specific I/O request. This is accepted and, indeed, desired—except that prolonged starvation is bad. Starving read requests for even moderate lengths of time results in high application latency for applications issuing read requests during other disk activity. This high read latency adversely affects system performance and feel (see Sidebar “Test 2. Effects of High Read Latency”).
Test 2. Effects of High Read Latency
Start a streaming read in the background:
while true
do
cat big-file > /dev/null
done
Meanwhile, measure how long it takes for a read of every file in the kernel source tree to complete:
time find . -type f -exec cat '{}' ';' > /dev/null
This measures the performance of a series of small dependent reads during a large streaming read.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- RSS Feeds
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Designing Electronics with Linux
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- A Topic for Discussion - Open Source Feature-Richness?
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Validate an E-Mail Address with PHP, the Right Way
- What's the tweeting protocol?
- Kernel Problem
9 hours 4 min ago - BASH script to log IPs on public web server
13 hours 31 min ago - DynDNS
17 hours 7 min ago - Reply to comment | Linux Journal
17 hours 39 min ago - All the articles you talked
20 hours 3 min ago - All the articles you talked
20 hours 6 min ago - All the articles you talked
20 hours 7 min ago - myip
1 day 32 min ago - Keeping track of IP address
1 day 2 hours ago - Roll your own dynamic dns
1 day 7 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
Very informative read.
Thanks for taking the time to publish this article.
I'm currently studying operating system fundamentals at university and it was great to read on it's usage in the Linux operating system. I had learned how Linux treats all hardware items as 'files', but reading this has given me a much more in-depth understanding.
*Bookmarked*
great article
Hi this is grate, is there any other documents like this plain and symple
Thanks
Best regards
Nice article
I like this. It describes the problem to the point and nothing else. Thanks for writing it.
Re: Kernel Korner: I/O Schedulers
Elementary my dear Watson.
That's only a mere 122 fold increase in test two.
Now for your next trick.......
Mick.
What about driver improvements?
Your discounting the fact that there are many other performance enhancing improvements to the 2.6 kernel that have highly boosted read/write performance. Especially in the form of better supported IDE/SCSI devices and better drivers and feature sets.
I agree that anticipatory scheduler is MUCH better than the older schedulers, but recognize hardware/software improvements as well.
Anticipatory is not the best for everyone also, I recommend any person or company to run many many tests that will emulate your environment to test which scheduler is best for you...
-Tom
Nice article. However, I
Nice article. However, I agree that schedulers needs to be tested in a particular environment to be sure which choice is best.
For example, AS is a poor choice for RAID and/or virtual machines. Virtual hosts with RAID disks appear to work best with the deadline elevator. VM guests should use the noop elevator so not to disguise physical access with virtual characteristics, let alone the ineffective and unnecessary overhead of using AS in a VM guest.
-shubes