Block Device Drivers: Optimization
Last month, I ran out of space for this column just as I was about to start a long discussion on optimizations. In the interest of usefulness, I will mention only the most useful optimizations here before going on to discuss initialization.
I won't have any sample code implementations of the optimizations, because these are complex issues that need to be handled in device-specific ways, and just as the code was much more vague when I discussed interrupt-driven drivers than when I introduced the basic, initial device driver, it would either be so large as to take over the entire magazine or be so vague as to be completely useless if I were to try to write some here. I hope that my explanations are useful to you even without code.
Also, I should warn you that the optimizations I talk about, while representative of common optimizations, are not necessarily representative of anything you will find in the Linux source code, except where I explicitly state otherwise. I'm writing on a slightly more theoretical level—about what can be done, rather than what has been done. Things conceptually similar to what I write about have been done, but the details are my own and may not be the best way to go about things. As usual, use this column as an introduction to the kernel source code and read the actual source code for far more insight than I can give you here.
If you are code-starved and don't care about optimizations, jump right to the second part of this article, where I talk about initialization.
One common optimization is coalescing adjacent requests. This means that when the driver is notified or notices that a request has been added to the request queue, it looks through the request list to see if there is a request for the next block (and possibly more blocks beyond that). If so, it sends a request to the hardware to read more than one block with one command, and when the data comes in from the hardware (presumably in an interrupt service routine), it fills both requests before calling end_request(1) (actually, some similar function designed especially for that driver) on either request. After satisfying (or failing to satisfy) the requests, the equivalent of end_request() is called for each request, but without waking up processes waiting on either request until the interrupt has been satisfied.
This will require that you write your own version of end_request(). Although this probably sounds daunting, it isn't as hard as it sounds, because you can use almost all of it as-is. For example, you could copy it verbatim, except instead of doing wake_up(&wait_for_request) at the end, you could add wait_for_request to a list of events to wake up when you are ready. Then you would call this new almost_end_request() function as soon as you have finished processing each request. When you are done handling the entire interrupt and are ready to wake up processes, iterate over the list of events, calling wake_up() on each in turn, from first satisfied to last satisfied.
Note that wake_up() will not cause a context switch directly. The driver will not give up control while running wake_up() to a process being woken up. Instead, wake_up() makes all the processes being woken up “runnable”, and sets the need_resched flag. This flag says that the scheduler ought to be called at the next convenient time, such as when returning from a “slow” interrupt handling routine (including the clock handling routine) or when returning from a system call. This means that the driver will not be pre-empted by calling wake_up(), and so it will be able to wake up all the necessary processes without being pre-empted.
This will likely take several tries to get right. All I can say to help is “Make sure you have backups. Really.”
The only driver in the Linux kernel that I have noticed doing anything like this is the floppy driver; the track buffer works in a similar way, where more than one request may be satisfied by a single read command sent to the hardware. If you are interested in investigating how it works, read drivers/block/floppy.c and search for floppy_track_buffer and read the entire function make_raw_rw_request().
Sounds like a “boondoggle”, doesn't it? Scatter-gather is perhaps a little bit similar in concept to coalescing adjacent requests, but is used with more intelligent hardware, and is perhaps a bit easier to implement. The “scatter” part means that when there are multiple blocks to be written all over a disk (for example), one command is sent out to initiate writing to all those different sectors, reducing the overhead involved in negotiation from O(n) to O(1), where n is the number of blocks or sectors to write.
Similarly, the “gather” part means that when there are multiple blocks to be read, one command is sent out to initiate reading all the blocks, and as the disk sends in each block, the corresponding request is marked as satisfied with end_request(1) or equivalent device-specific code. You will only be able to easily use end_request() unmodified with scatter-gather if each block read or written results in a separate interrupt being generated, and perhaps not even then. The SCSI driver does its own, which is probably the best way to go.
If you want to increase throughput, at the slight expense of response time, you could use timers to help: when your request() is notified that there is a request, and sees that there is only one request outstanding, it could set a timer to go off soon (one or two tenths of a second, perhaps), assuming that while waiting, more requests will spill in to be dealt with, and that when a certain number of requests have been made, or the timer has gone off, whichever comes first, scatter-gather will be applied to the blocks. If the request() routine is called and notices that “enough” (however many that is...) requests have accumulated, it would un-install the timer and process the requests. If the timer were to go off, all requests would be processed.
Note that the timer used should not be the same static timer used for the hardware timeout. Instead, it should be a dynamically allocated timer. See <linux/timer.h> for details on the dynamic timer routines.
I will repeat my standard disclaimer: this is simplified (at least, I'm trying to simplify it...) and if you want more detailed and correct information, study a real driver. The SCSI high-level drivers (scsi.c, sd.c, sr.c) are definitely the place to start. (I don't mention st.c and sg.c because they are character, not block, devices.)
- Transitioning to Python 3
- Returning Values from Bash Functions
- Tech Tip: Really Simple HTTP Server with Python
- Linux Journal December 2016
- Radio Free Linux
- Red Hat OpenStack Platform
- A Better Raspberry Pi Streaming Solution
- CORSAIR's Carbide Air 740
- Stepping into Science
- The Tiny Internet Project, Part II