Real-Time and Linux, Part 2: the Preemptible Kernel

by Kevin Dankwardt

In the January/February 2002 issue of Embedded Linux Journal, we examined the fundamental issues of real time with Linux. In this article we examine efforts to bring real-time capabilities to applications by making improvements to the Linux kernel. To date, the majority of this work has been to make the kernel more responsive--to reduce latency by reducing the preemption latency, which can be quite long in Linux.

By improving the kernel, and not changing or adding to the API, applications can run more responsively by merely switching out a standard kernel for the improved one. This is a big benefit. It means that ISVs need not create special versions for different real-time efforts. For example, DVD players may run more reliably on an improved kernel without needing to be aware that the kernel they are running on has been improved.

Background and History

With around Linux kernel release 2.2, the issue of kernel preemptibility began to get quite a lot of attention. Paul Barton-Davis and Benno Senoner, for example, wrote a letter (which in addition was signed by many others) to Linus Torvalds, asking that 2.4 please include significantly reduced preemption delays (http://lists.insecure.org/linux-kernel/2000/Jul/0123.html).

Their request was based on their desire to have Linux function well with audio, music and MIDI. Senoner produced some benchmarking software that demonstrated that the 2.2 kernel (and later the 2.4 kernel) had worst-case preemption latencies on the order of 100ms (http://www.gardena.net/benno/linux/audio/). Latencies of this magnitude are unacceptable for audio applications. Conventional wisdom seems to say that latencies on the order of no more than a few milliseconds are required.

Two efforts emerged that produced patched kernels that provided quite reasonable preemption latencies. Ingo Molnar (of Red Hat) and Andrew Morton (then of The University of Wollongong) both produced patch sets that provided preemption within particularly long sections in the kernel. You can find Ingo Molnar's patches at http://people.redhat.com/mingo/lowlatency-patches/, and you can find Andrew Morton's work at http://www.zipworld.com.au/~akpm/linux/schedlat.html.

In addition, Morton provides tools for measuring latencies, such as periods where the kernel ignores reschedule requests. His low-latency patches' web page, cited above, provides information on those as well.

Recently, at least two organizations have produced preemptible kernels that provide a more fundamental, and powerful, solution to the kernel preemptibility problem.

In the first article of this series in the January/February 2002 issue of ELJ, we listed several other desired features for real-time support in Linux, including increased number of priority levels, user-space interrupt handling and DMA, priority inheritance on synchronization mechanisms, microsecond time resolution, complete POSIX 1003.1b functionality and a constant time algorithm for scheduling. We will briefly comment on these as well.

A key point to remember with all of these improvements is that they involve patching the kernel. Anytime you patch a kernel you must assume that you no longer have binary compatibility for other kernel code, such as drivers. For example, the preemptible kernel approaches require modifying the code for spin locks. A binary driver won't employ this modification and thus may not prevent preemption properly. This emphasizes the need to have the source and recompile all kernel code. The Linux model for drivers is one of source-compatibility anyway. Distribution of binary-only drivers is discouraged for compatibility as well as for open-source philosophy reasons.

Improvements

Various efforts that improve the kernel provide essentially transparent benefits. The efforts to improve the preemptibility of the kernel, be they through a preemptible kernel or through preemption points, result in a kernel that is more responsive to applications without any alterations in these applications.

Another aspect of transparency is whether the changes are transparent to the kernel, or in other words, do the approaches automatically track with changes in the kernel. The preemption point approaches of Molnar and Morton require that the scheduling latencies in new kernels be measured and preemption points placed in the proper places.

In contrast, the approaches to creating a preemptible kernel piggyback on the SMP locking and thus automatically transfer with new kernel versions. Also, by tying the preemptibility to the SMP-locking mechanism, as kernel developers improve the granularity of the SMP locking, the granularity of the preemption will improve automatically as well. We are likely to see steady improvement in SMP-locking granularity because improvement in this is required for improved SMP scaling.

It is because of this co-opting of the SMP locks that the preemptible kernel work depends upon a 2.4 or newer kernel. Prior kernels lacked the required SMP locks.

Another important benefit of the preemptible kernel approach to emphasize is that the approach makes code, which is otherwise unaware of it, preemptible. For example, driver writers need do nothing special to have their driver preemptible. Code in the driver will be preempted as required unless the driver holds a lock. Thus, as in other parts of the kernel, well-written drivers that are SMP-safe automatically will benefit from a preemptible kernel. On the other hand, drivers that are not SMP-safe may not function correctly with the preemptible kernels.

One should be aware, though, that just because one's driver does not request a lock, kernel code calling it may. For example, we found in a simple test with MontaVista's preemptible kernel that the functions read() and write() of a dynamically loaded driver were preempted just fine, while the functions init_module(), open() and close() were not. This means that if a low-priority process does an open() or close(), it may delay its preemption by a newly awoken high-priority process.

In practice, developers still should measure the latencies they are seeing. With the preemptible kernel approaches we see that it is still possible that a section of kernel code can hold a lock for a period longer than acceptable for one's application.

MontaVista, for example, provides a preemptible kernel, adds a few preemption points in sections where locks are held too long and provides measurement tools so that developers can measure the preemptibility performance with their actual applications and environment.

The goal of SMP locks is to ensure safe re-entrance into the kernel. That is, if processes running in parallel require kernel resources, access to these resources is done safely. The smaller the granularity of the locking, the greater the chance that competing processes can continue to execute in parallel. Parallelization is improved as the blocking (because of contention) is reduced.

This concept applies to uniprocessors as well, when I/O is considered. If one considers I/O devices as separate processors, then parallelization, or throughput, improves as applications and I/O activities can continue in parallel. Improvements in preemptibility, which imply that high-priority I/O-bound processes wake up more quickly, can thus improve throughput. Thus, somewhat paradoxically, we see that even though we may experience more context swaps and execute more code in critical kernel paths, we may still see greater system throughput.

The benefits of a preemptible kernel seem to be so clear that we can expect preemptibility eventually to be a standard feature of the Linux kernel. Preemptible kernels have been shown to reduce latencies to just a few milliseconds for some implementations and to as low as tens of microseconds in others.

In a quick survey of embedded Linux vendors, MontaVista and TimeSys provide preemptible kernels, REDSonic has preemption points, LynuxWorks and Red Hat use RTLinux. Lineo uses RTAI. OnCore provides Linux preemptibility both through a Linux system call-compatible API (as does LynuxWorks with LynxOS) and through running a Linux kernel (which effectively becomes preemptible) on top of their preemptible microkernel.

Preemption Points

Preemption points are essentially calls to the scheduler to check to see if a higher-priority task is ready and should be run. Molnar and Morton timed paths in the kernel and found sections that were quite long and inserted the schedule check calls. You can readily find these places by examining the patches, or by applying the patches and comparing the before-and-after versions of the affected source files. Preemption patches look like if (current ->need_resched) schedule();.

To use Andrew Morton's preemption point kernel patch, download the patch from the URL above and download the appropriate Linux kernel version from http://www.kernel.org/pub/linux/kernel/. Apply the patch and rebuild the kernel as usual. More details can be found at http://www.linuxdj.com/audio/lad/contrib/2.4-install-notes.html, although the notes are for an old 2.4 kernel. Also, take note that you may need to update your development environment.

To use Molnar's patches you do the same thing. Download the patch and create a new kernel. Morton has patches for many 2.4 kernels. Molnar has patches for some 2.2 kernels and some early 2.4 kernels.

Preemptible Kernels

Preemptible kernels provide for one user process to be preempted in the midst of a system call so that a newly awoken higher-priority process can run. This preemption cannot be done safely at arbitrary places in the kernel code. One section of code where this may not be safe is within a critical section. A critical section is a code sequence that must not be executed by more than one process at the same time. In the Linux kernel these sections are protected by spin locks.

MontaVista and TimeSys have taken similar approaches to creating a preemptible kernel. They cleverly alter the spin-lock calls to additionally prevent preemption. In this way, preemption is permitted in other sections. When a higher-priority process awakens, the scheduler will preempt a lower-priority process in a system call if the system call code has not indicated, via the modified spin-lock code, that preemption is not possible.

In addition, with a preemptible kernel, breaking locks to allow rescheduling is simpler than with the preemption (low-latency) patches. If the kernel releases a lock and then re-acquires it, when the lock is released preemption will be checked for. There are places in the kernel where a lock is held, say in a loop, where it need not be held the entire time. Perhaps for each iteration it can be released and then re-acquired.

MontaVista implements preemption through a counter. When a spin lock is acquired, the counter is incremented. When a high-priority process awakens, the scheduler checks to see whether the preemption counter indicates, by having the value zero, that preemption is allowed. By employing a counter, the mechanism works when locks are nested. With this mechanism, however, any spin-lock-held critical section prevents preemption, even if the lock is for an unrelated resource.

TimeSys employs a priority inheritance mutex. With this mechanism, a high-priority process can preempt a low-priority process that holds a mutex for a different resource. In addition, since they employ priority inheritance, low-priority processes holding a mutex cannot indefinitely postpone a higher-priority process waiting on the mutex. This solves the so-called Priority Inversion Problem.

One can obtain the preemption patches developed by MontaVista from the SourceForge web site, http://sourceforge.net/projects/kpreempt/. MontaVista is conducting this work in a laudable, open-source manner. They also provide their work on a real-time scheduler and high-resolution timers on SourceForge at http://sourceforge.net/projects/rtsched/ and http://sourceforge.net/projects/high-res-timers/.

The SourceForge kpreempt Project also gives a link to Robert Love's preemptible kernel work [see the April and May 2002 issues of Linux Journal for more information on Love's kernel work]. These are MontaVista's patches and are now maintained by Love, although MontaVista is still involved. The newest patches are available at ftp://ftp.kernel.org/pub/linux/kernel/people/rml/preempt-kernel/.

A recent release of Love's work was created to work with a recent constant time scheduler patch by Ingo Molnar. Molnar's O(1) scheduler is available as a patch for 2.4 and has been merged into 2.5. TimeSys makes their preemptible kernel available on their web site, http://www.timesys.com/. The preemptible kernel is provided already patched. To obtain the patches, you need to back them out with diff from a 2.4.7 kernel source tree. Their preemptible kernel source is released under the GPL.

TimeSys additionally has a number of other valuable capabilities for real-time developers that are not available for free download. These include technology for real-time scheduling and resource allocation. These modules add additional system calls, for example, to provide for conventional access to their enhancements.

For those interested in examining the gory details we provide a couple of hints on where to look. The key to the spin-lock mechanism is the include file, spinlock.h, under include/linux. Both MontaVista and TimeSys modify this file.

Interestingly, both seem to rename, and continue to use, the old functions. The original spin-lock functions still are required. It is not acceptable, for example, to preempt the kernel while it is in the scheduler. Infinite recursion would ensue. MontaVista uses names like _raw_spin_lock and _raw_read_lock; TimeSys uses names like old_spin_lock and old_spin_lock_irq.

By examining the file kernel/include/linux/mutex.h in the TimeSys distribution you can see that spin locks have been defined to use write_lock() and read_lock() functions that implement mutex locks. The file kernel/kernel/mutex.c includes the source to the do_write_lock() function, for example, which implements the mutex locking functionality.

Other Real-Time Kernel Efforts

Another popular area for improvement is in the granularity of timing. TimeSys, MontaVista, REDSonic and others have solutions that greatly improve time resolution. For example, TimeSys queries the Pentium Time Stamp Counter on context switches to insure quite accurate CPU time accounting for use of such functions as getrusage().

In the opinion of many developers, including this author, Linux's lack of the complete set of POSIX 1003.1b functionality is a significant shortcoming. Luckily, there are solutions. In particular, TimeSys has quite a good implementation.

In addition to their POSIX contributions, TimeSys has developed some innovative resource control mechanisms. These allow a real-time application to reserve CPU time or network bandwidth, for example. This, coupled with their interrupt threading model, preemptible kernel and other features, provide two or three orders of magnitude of improvement in terms of latency over a standard Linux kernel.

To date, it appears that little has been done to allow a user-space application to register a function to be called as an interrupt handler. This mechanism is called user-space interrupt handling and is available, for example, in IRIX, SGI's UNIX.

Interestingly, SGI, in Linux, provides user-space access to the interrupts from a real-time clock in their ds1286 real-time clock interface. This can be obtained from http://www.linuxhq.com/kernel/v2.4/patch/patch-2.4.10-pre1/linux/drivers/sgi/char/ds1286.c.html.

Related to user-level interrupt handling is user-space DMA to and from devices. There is a patch to provide that functionality at http://linux-patches.rock-projects.com/v2.2-f/userdma-2.2.5.html.

Guarantees

Apparently, no real-time Linux vendor is willing to make a guarantee for latency. A guarantee, if given, may have the form of something like

With our Linux kernel, and these hardware requirements, and these drivers, etc., we guarantee that your application, if it is locked in memory, has the highest priority...and will be awoken within N microseconds after your real-time device raises a hardware interrupt. If you are not able to achieve the guarantee, then we will treat it as a bug.

Since we see no such guarantees, what can we infer? We can think of several possibilities.

Vendors do not see any benefit in making the guarantee. No customers request it. In our opinion, many developers want a guarantee. In fact, hard real time implies a guarantee.

Vendors have not sufficiently measured their kernels and environment to be able to give a guarantee. This is a bit tricky. Measuring alone can't prove a guarantee can be satisfied. One must determine that the code is bounded in every circumstance and that all worst-case paths are measured. From the vendors' announcements, it is apparent that many of them have spent quite a lot of effort both measuring and studying the code. In fact, it is likely that many engineers feel rather confident that they could guarantee a certain number given the right environment.

Linux is too diverse to allow for any meaningful guarantees. This is likely the heart of the issue. Developers want to be able to modify their kernel. They want to be able to download drivers and use them. Activities such as these are beyond the control of a vendor. If a vendor were to claim a guarantee publicly, it may have to be for a system so constrained as to be useful for only one or a few select situations.

Perhaps we'll see some kind of compromise guarantee, something like ``100ms or less on Pentium class computers for properly behaving applications'' plus the time spent in drivers. The driver caveat is important, for example, because the interrupt handling code is probably in the driver and thus a major part of the latency path.

What's Next?

In the third article of our series we will discuss real-time functionality available through means outside of a Linux kernel. We will consider approaches such as RTLinux and RTAI. We also will return to benchmarking and comparing the wide variety of options.

Real-Time and Linux, Part 2: the Preemptible Kernel

Kevin Dankwardt is founder and CEO of K Computing, a training and consulting firm in Silicon Valley. In particular, his organization develops and delivers embedded and real-time Linux training worldwide.

Load Disqus comments