Kernel Korner - Using RCU in the Linux 2.5 Kernel

Read-copy update, a synchronization technique optimized for read-mostly data structures, is new with the 2.5/2.6 kernel and promises better SMP scalability.
Lesson: Avoid Expensive Operations

If you care about performance, you want to avoid these expensive operations. Avoiding them is precisely what RCU does, at least for read-only accesses to read-mostly data structures, although the DEC Alpha still requires some read-side memory barriers. As seen in Figure 4, RCU scales well and has good single-CPU performance for the hash-table-search benchmarklet.

Figure 4. RCU Read Performance by Number of Processors

Of course, updates do slow down RCU, as shown in Figure 5. This graph illustrates the relative performance of these synchronization primitives as the workload varies from read-only (left-hand side) to write-only (right-hand side). RCU is better than brlock across the board. In fact, RCU has replaced brlock in the 2.5 kernel, thanks to Steve Hemminger of OSDL and a number of Linux's networking luminaries. RCU is the best option overall as long as fewer than about one-third of the accesses are updates. Again, your mileage will vary depending on your workload and hardware. In particular, workloads with greater per-element local processing—for example, more complex comparisons—would scale better. As always, use the right tool for the job.

Figure 5. RCU Performance by Fraction of Accesses That Are Updates

How Does RCU Work?

If reading CPUs never make their presence known, how can updating CPUs avoid messing up readers? With locks, the updating CPU examines the lock state to determine when it is safe to carry out the update. With RCU, the updating CPU must make this determination indirectly.

The trick is RCUs reading CPUs are not permitted to block while traversing the data structure, the same as when CPUs holding a spinlock or rwlock are not permitted to block. This means that once an element is unlinked from a list, any CPU that subsequently performs a context switch cannot possibly gain a reference to this element. Context switch is a quiescent state: CPUs undergoing context switches cannot hold references to RCU-protected data structures. Any time period during which all CPUs pass through a quiescent state is a grace period. A CPU may therefore free up an element after a grace period has elapsed from the time that it unlinked the element from the list.

Thus, a simple, though inefficient, RCU-based deletion algorithm could perform the following steps in a non-preemptive Linux kernel:

  • Unlink element B from the list, but do not free it—the state of the list as shown in Step 2 of Figure 6.

  • Run on each CPU in turn. At this point, each CPU has performed one context switch after element B has been unlinked. Thus, there cannot be any more references to element B, as shown in Step 3 (Figure 6).

  • Free up element B, as shown in Step 4 (Figure 6).

Figure 6. Steps of a Simple RCU-Based Deletion Algorithm

Andrea Arcangeli created a more efficient algorithm that boasts extremely short grace periods, which was the first Linux RCU implementation shipped. Dipankar Sarma coded up an even more efficient RCU implementation that maintains callback cache locality and permits a grace period to service any number of concurrent updates. Dipankar's algorithm is included in the 2.5 kernel and was described in detail at the Ottawa Linux Symposium in 2002.


Listing 1 shows the external API for RCU. The synchronize_kernel() function blocks for a full grace period. This is an easy-to-use function, but it incurs expensive context-switch overhead. It also cannot be called with locks held or from interrupt context. However, it does allow concurrent callers to share a grace period.

In contrast, call_rcu() schedules a callback function func(arg) after a full grace period. Because call_rcu() never sleeps, it may be called with locks held and from interrupt context. The call_rcu() function uses its struct rcu_head argument to remember its callback function and argument during the grace period. An rcu_head is often placed within the structure being protected by RCU, eliminating the need to allocate it separately.

The primitives rcu_read_lock() and rcu_read_unlock() demark a read-side RCU critical section but generate no code in non-preemptive kernels. In preemptive kernels, they disable preemption, thereby preventing preemption from prematurely ending a grace period. RCU-like mechanisms also may be used in preemptive kernels without suppressing preemption, as demonstrated by the K42 research OS.

Most modern microprocessors feature weak memory-consistency models, which require special memory-barrier instructions. However, these instructions are difficult to understand and even more difficult to test, so the Linux list-manipulation API is extended to include memory barriers, as suggested by Manfred Spraul and as shown in Listing 2. The hlist primitives recently were added by Andi Kleen in order to reduce memory requirements for large hash tables.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Can you explain a little more about smp_wmb?

Anonymous's picture

It seems that smp memory barrier is tightly linked with RCU. Codes taking advantage of RCU also use smp_wmb to deal with weak memory-consistency processes, for example, in the routing cache. So, I think understanding this thing is key to the "USEING" of RCU. Thanks.

Re: Can you explain a little more about smp_wmb?

Anonymous's picture

When you are using the linux/list.h _rcu macros along with normal locking to protect against concurrent updates, the memory barriers are taken care of for you. Many of the places where one can use RCU do involve linked lists, so this works much of the time. However, if you are in a situation where you cannnot use these macros, then you are quite right that an understanding of memory barriers is required.

Modern CPUs are within their rights to reorder operations unless explicitly told not to. Therefore, locking primitives on many CPUs contain memory barriers that prevent (for example) the contents of the critical section from "bleeding out" into the surrounding code. Any such "bleeding" would mean that part of the critical section was no longer protected by the lock, which would result in failure. Hence the memory barriers in locking primitives.

This situation means that lock-free operations require explicit memory barriers. A full explanation is beyond the scope of this comment, but the LKML thread starting with this message and this web page are a place to start. If you want a full treatment, Gharachorloo's Ph.D. thesis is a place to finish.