Kernel Korner - Using RCU in the Linux 2.5 Kernel

Read-copy update, a synchronization technique optimized for read-mostly data structures, is new with the 2.5/2.6 kernel and promises better SMP scalability.
RCU Synchronizing with NMIs

Retrofitting existing code with RCU as shown above can produce significant performance gains. The best results, of course, are obtained by designing RCU into the algorithms and code from the start.

The i386 oprofile code contains an excellent example of designed-in RCU. This code can use NMIs (nonmaskable interrupts) to handle profiling independently of the normal clock interrupt, which permits profiling of the clock interrupt handler. Synchronizing with NMIs traditionally has been difficult; by definition, no way exists to block an NMI. Straightforward locking designs therefore are subject to deadlock, where the CPU holding the lock receives an NMI, and the NMI handler spins forever on this same lock. Another approach is to mask NMIs in software using things like spin_trylock(). This method, however, produces cache bouncing and memory-barrier overhead, and the NMIs thus masked are lost. The solution in nmi_timer_int.c is as shown in Listing 4.

The synchronize_kernel() ensures that any NMI handlers executing the old NMI callback upon entry to timer_stop() have completed before timer_stop() returns. The code for oprofile_stop() and oprofile_shutdown() shown in Listing 5 illustrates why this is important. Notice that oprofile_ops->stop() invokes timer_stop(). Therefore, if oprofile_stop() and oprofile_shutdown() were called in quick succession, the newly freed CPU buffers could be accessed by an ongoing NMI. This action could surprise any code quickly reallocating this memory.

Use of RCU eliminates this race naturally, without incurring any locking or memory-barrier overhead.

Incremental Use of RCU

Using RCU is not an all-or-nothing affair. It may be applied incrementally to particular code paths as needed. A good example of this is a patch coded by Dipankar Sarma that prevents ls /proc from blocking fork(). The changes are as follows:

  1. The read_lock() and read_unlock() of tasklist_lock in get_pid_list() are replaced by rcu_read_lock() and rcu_read_unlock(), respectively.

  2. A struct rcu_head is added to task_struct in order to track the task structures waiting for a grace period to expire.

  3. The put_task_struct() macro invokes __put_task_struct() via call_rcu() rather than directly. This ensures that all concurrently executing get_pid_list() invocations complete before any task structures that they might have been referencing are freed.

  4. The SET_LINKS() and REMOVE_LINKS() macros use the _rcu form of the list-manipulation primitives.

  5. The for_each_process() macro gets a read_barrier_depends() to make this code safe for the DEC Alpha.

The problem is get_pid_list() traverses the entire tasklist in order to build the PID list needed by ls /proc. It read-holds tasklist_lock during this traversal and blocks updates to the tasklist, such as those performed by fork(). On machines with large numbers of tasks, this can cause severe difficulties, particularly given multiple instances of certain performance-monitoring tools.

Dipankar's modifications are shown in Table 4, changing only two files, adding 13 lines and deleting seven for a six-line net addition to the kernel. This patch does delete a pair of tasklist_lock uses, but none of the other 249 uses of tasklist_lock are modified. This example demonstrates use of RCU for a late-in-cycle optimization.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Can you explain a little more about smp_wmb?

Anonymous's picture

It seems that smp memory barrier is tightly linked with RCU. Codes taking advantage of RCU also use smp_wmb to deal with weak memory-consistency processes, for example, in the routing cache. So, I think understanding this thing is key to the "USEING" of RCU. Thanks.

Re: Can you explain a little more about smp_wmb?

Anonymous's picture

When you are using the linux/list.h _rcu macros along with normal locking to protect against concurrent updates, the memory barriers are taken care of for you. Many of the places where one can use RCU do involve linked lists, so this works much of the time. However, if you are in a situation where you cannnot use these macros, then you are quite right that an understanding of memory barriers is required.

Modern CPUs are within their rights to reorder operations unless explicitly told not to. Therefore, locking primitives on many CPUs contain memory barriers that prevent (for example) the contents of the critical section from "bleeding out" into the surrounding code. Any such "bleeding" would mean that part of the critical section was no longer protected by the lock, which would result in failure. Hence the memory barriers in locking primitives.

This situation means that lock-free operations require explicit memory barriers. A full explanation is beyond the scope of this comment, but the LKML thread starting with this message and this web page are a place to start. If you want a full treatment, Gharachorloo's Ph.D. thesis is a place to finish.