Introducing the 2.6 Kernel
The purpose of kernel preemption is to lower scheduling latency. The result is improved system response and interactive feel of the system. The Linux kernel became preemptive with version 2.5.4. Previously, kernel code executed cooperatively. This meant a process—even a real-time one—could not preempt another process executing a system call in the kernel. Consequently, a lower priority process could priority invert a higher priority process by denying it access to the processor when it requested it. Even if the lower priority process' timeslice expired, it would continue running until it completed its work in the kernel or voluntarily relinquished control. If the higher priority process waiting to run is a text editor in which the user is typing or an MP3 player ready to refill its audio buffer, the result is poor interactive performance. Worse, if the higher priority process is a specialized real-time process, the result could be catastrophic.
Why was the kernel not preemptive from the start? Because it is more work to provide a preemptive kernel. If tasks in the kernel can reschedule at any moment, protection must be in place to prevent concurrent access to shared data. Thankfully, the issues a preemptive kernel creates are identical to the concerns raised by symmetrical multiprocessing (SMP). The mechanisms that provide protection under SMP were adapted easily to provide protection with kernel preemption. Thus, the kernel simply leverages SMP spinlocks as preemption markers. When code would hold a lock, preemption is similarly disabled. Otherwise, it is safe to preempt the current task.
Most likely, one can now see the next bottleneck. The preemptive kernel simply reduces scheduling latency from the entire length of kernel execution to the duration of spinlocks. It's definitely shorter, sure, but it's still a potential problem. Thankfully, reducing lock duration, which is equal to the length of time kernel preemption is disabled, is doable.
Kernel developers optimized kernel algorithms for lower latency. They primarily concentrated on the VM and virtual filesystem (VFS) and, consequently, greatly reduced the lock duration. The result is excellent system response. Users have observed worst-case scheduling latency in 2.5, even on average machines, at less than 500 nanoseconds.
The block layer is the chunk of the kernel responsible for supporting block devices. Traditional UNIX systems support two general types of hardware devices, character devices and block devices. Character devices, such as serial ports and keyboards, manipulate data as a stream of characters, or bytes, one at a time. Conversely, block devices manipulate data in groups of a fixed size (called blocks). Block devices do not merely send or receive a stream of data; instead, any of their blocks are accessible. Moving to one block from another is called seeking. Examples of block devices include hard disks, CD-ROM drives and tape backup devices.
Managing block devices is a nontrivial job. Hard disks are complicated pieces of hardware for which the operating system needs to support arbitrary reading and writing of any valid block. Further, because seeks are expensive, the operating system must intelligently manage and queue requests to block devices to minimize seeks.
The block layer in Linux was in serious need of a redesign. Thankfully, starting with kernel 2.5.1, the revamp began. The most interesting work involved creating a new flexible and generic structure to represent block I/O requests, eliminating bounce buffers and supporting I/O directly into high memory, making the global io_request_lock per queue and building a new I/O scheduler.
Prior to 2.5, the block layer used the buffer_head structure to represent I/O requests. This method was inefficient for several reasons, the largest being the block layer often had to break the data structures into smaller chunks, only to reconstruct them later in the I/O scheduler. In 2.5, the kernel makes use of a new data structure, struct bio, to represent I/O. This structure is simpler, appropriate for both raw and buffered I/O, works with high memory and may be split and merged easily. The block layer consistently uses the new bio structure, resulting in cleaner, more efficient code.
The next issue was eliminating the bounce buffer used when performing I/O into high memory. In the 2.4 kernel, an I/O transfer from a block device into high memory has to make an unfortunate extra stop. High memory is non-permanently mapped memory for which the kernel must provide special support. On Intel x86 machines, this is any memory over about 1GB. Any I/O request into high memory (for example, reading a file from a hard drive into a memory address greater than 1GB) must make use of a special bounce buffer that resides in low memory. The rationale is that some devices may be unable to understand high memory addresses. The result is devices always must perform their I/O transfers into low memory. If the final destination is in fact high memory, the data must bounce from the block device to low memory and finally into high memory (Figure 2). This extra copy introduces significant overhead. The 2.5 kernel now automatically supports transferring directly into high memory, eliminating the bounce buffer logic for devices that are capable.
The next bottleneck that developers tackled was the global I/O request lock. Each block device is associated with a request queue, which stores block I/O requests, the individual bio structures that represent each block read or write. The kernel constantly updates the queues as drivers add or remove requests. The io_request_lock protects the queues from concurrent access—code may update a queue only while holding the lock. In kernels prior to 2.5, a single global lock protects all the request queues in the system. The global lock prevents concurrent access to any queue, and the lock merely needs to prevent concurrent access to any single queue. In 2.5, a fine-grained lock for each queue replaced the global request lock (Figure 3). Consequently, the kernel now can manipulate multiple queues at the same time.
Finally, a new I/O scheduler solved the remaining block layer inefficiency. The I/O scheduler is responsible for merging block requests and sending them to the physical devices. Because seeks are expensive, the I/O scheduler prefers to service contiguous requests. To this end, it sorts incoming requests by sector. This is an important feature for both disk performance and longevity. The problem is, however, that repeated I/O requests to contiguous sectors could prevent servicing of a request for a nonadjacent sector. The new I/O scheduler solves this problem by implementing deadlines for I/O requests. If the I/O scheduler starves a request past its deadline, the I/O scheduler services the starved request rather than continuing to merge requests at the current sector. The new I/O scheduler also solves the writes-starving-reads problem by giving preferential treatment to read requests over write requests. This change greatly improves read latency. Last but not least, the request queue is now a red/black tree, which is an easily searchable data structure, instead of a linear list.
Practical Task Scheduling Deployment
July 20, 2016 12:00 pm CDT
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.Register Now!
- SUSE LLC's SUSE Manager
- My +1 Sword of Productivity
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- Managing Linux Using Puppet
- Non-Linux FOSS: Caffeine!
- Tech Tip: Really Simple HTTP Server with Python
- Doing for User Space What We Did for Kernel Space
- Parsing an RSS News Feed with a Bash Script
- Rogue Wave Software's Zend Server
- SuperTuxKart 0.9.2 Released
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide