Block Device Drivers: Optimization
Last month, I ran out of space for this column just as I was about to start a long discussion on optimizations. In the interest of usefulness, I will mention only the most useful optimizations here before going on to discuss initialization.
I won't have any sample code implementations of the optimizations, because these are complex issues that need to be handled in device-specific ways, and just as the code was much more vague when I discussed interrupt-driven drivers than when I introduced the basic, initial device driver, it would either be so large as to take over the entire magazine or be so vague as to be completely useless if I were to try to write some here. I hope that my explanations are useful to you even without code.
Also, I should warn you that the optimizations I talk about, while representative of common optimizations, are not necessarily representative of anything you will find in the Linux source code, except where I explicitly state otherwise. I'm writing on a slightly more theoretical level—about what can be done, rather than what has been done. Things conceptually similar to what I write about have been done, but the details are my own and may not be the best way to go about things. As usual, use this column as an introduction to the kernel source code and read the actual source code for far more insight than I can give you here.
If you are code-starved and don't care about optimizations, jump right to the second part of this article, where I talk about initialization.
One common optimization is coalescing adjacent requests. This means that when the driver is notified or notices that a request has been added to the request queue, it looks through the request list to see if there is a request for the next block (and possibly more blocks beyond that). If so, it sends a request to the hardware to read more than one block with one command, and when the data comes in from the hardware (presumably in an interrupt service routine), it fills both requests before calling end_request(1) (actually, some similar function designed especially for that driver) on either request. After satisfying (or failing to satisfy) the requests, the equivalent of end_request() is called for each request, but without waking up processes waiting on either request until the interrupt has been satisfied.
This will require that you write your own version of end_request(). Although this probably sounds daunting, it isn't as hard as it sounds, because you can use almost all of it as-is. For example, you could copy it verbatim, except instead of doing wake_up(&wait_for_request) at the end, you could add wait_for_request to a list of events to wake up when you are ready. Then you would call this new almost_end_request() function as soon as you have finished processing each request. When you are done handling the entire interrupt and are ready to wake up processes, iterate over the list of events, calling wake_up() on each in turn, from first satisfied to last satisfied.
Note that wake_up() will not cause a context switch directly. The driver will not give up control while running wake_up() to a process being woken up. Instead, wake_up() makes all the processes being woken up “runnable”, and sets the need_resched flag. This flag says that the scheduler ought to be called at the next convenient time, such as when returning from a “slow” interrupt handling routine (including the clock handling routine) or when returning from a system call. This means that the driver will not be pre-empted by calling wake_up(), and so it will be able to wake up all the necessary processes without being pre-empted.
This will likely take several tries to get right. All I can say to help is “Make sure you have backups. Really.”
The only driver in the Linux kernel that I have noticed doing anything like this is the floppy driver; the track buffer works in a similar way, where more than one request may be satisfied by a single read command sent to the hardware. If you are interested in investigating how it works, read drivers/block/floppy.c and search for floppy_track_buffer and read the entire function make_raw_rw_request().
Sounds like a “boondoggle”, doesn't it? Scatter-gather is perhaps a little bit similar in concept to coalescing adjacent requests, but is used with more intelligent hardware, and is perhaps a bit easier to implement. The “scatter” part means that when there are multiple blocks to be written all over a disk (for example), one command is sent out to initiate writing to all those different sectors, reducing the overhead involved in negotiation from O(n) to O(1), where n is the number of blocks or sectors to write.
Similarly, the “gather” part means that when there are multiple blocks to be read, one command is sent out to initiate reading all the blocks, and as the disk sends in each block, the corresponding request is marked as satisfied with end_request(1) or equivalent device-specific code. You will only be able to easily use end_request() unmodified with scatter-gather if each block read or written results in a separate interrupt being generated, and perhaps not even then. The SCSI driver does its own, which is probably the best way to go.
If you want to increase throughput, at the slight expense of response time, you could use timers to help: when your request() is notified that there is a request, and sees that there is only one request outstanding, it could set a timer to go off soon (one or two tenths of a second, perhaps), assuming that while waiting, more requests will spill in to be dealt with, and that when a certain number of requests have been made, or the timer has gone off, whichever comes first, scatter-gather will be applied to the blocks. If the request() routine is called and notices that “enough” (however many that is...) requests have accumulated, it would un-install the timer and process the requests. If the timer were to go off, all requests would be processed.
Note that the timer used should not be the same static timer used for the hardware timeout. Instead, it should be a dynamically allocated timer. See <linux/timer.h> for details on the dynamic timer routines.
I will repeat my standard disclaimer: this is simplified (at least, I'm trying to simplify it...) and if you want more detailed and correct information, study a real driver. The SCSI high-level drivers (scsi.c, sd.c, sr.c) are definitely the place to start. (I don't mention st.c and sg.c because they are character, not block, devices.)
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- RSS Feeds
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Designing Electronics with Linux
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- A Topic for Discussion - Open Source Feature-Richness?
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Validate an E-Mail Address with PHP, the Right Way
- What's the tweeting protocol?
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




4 hours 50 min ago
9 hours 17 min ago
12 hours 52 min ago
13 hours 25 min ago
15 hours 48 min ago
15 hours 52 min ago
15 hours 53 min ago
20 hours 18 min ago
22 hours 9 min ago
1 day 3 hours ago