Scaling dcache with RCU

Reorganizing the way Linux caches filename lookups is a big win for helping to scale to large servers.

RCU (read-copy update), a synchronization technique optimized for read-mostly data structures, recently was added to the Linux 2.6 kernel. This article describes how RCU improved the scalability of Linux's directory-entry cache (dcache). For some more background, see “Using RCU in the Linux 2.5 Kernel”, in the October 2003 issue of Linux Journal.

Linux Directory-Entry Cache

Linux's dcache maintains a partial in-memory image of the filesystem hierarchy. This image enables pathname lookup without expensive disk transfers, greatly increasing the performance of filesystem operations. To ease handling of mount and unmount operations, the Linux kernel also maintains an image of the mount tree in struct vfsmount structures.

If the Linux 2.4 dcache is so great, why change it? The difficulty with the 2.4 dcache is it uses the global dcache_lock. This lock is a source of cache line bouncing on small systems and a scalability bottleneck on large systems, as illustrated in Figure 1.

Figure 1. Tux Doing His Duty

Visual Overview of dcache

This section provides background for the RCU-related dcache changes, which are described later in the article. Readers desiring more detail should dive into the source code.

This section uses the example filesystem tree shown in Figure 2, which has two mounted filesystems with roots r1 and r2, respectively. The second filesystem is mounted on directory b, as indicated by the dashed arrow. The file g has not been referenced recently and therefore is not present in dcache, as indicated by its dashed blue box.

Figure 2. Example Filesystem Tree

The dcache subsystem maintains several views of the filesystem trees. Figure 3 shows the directory structure representation. Each dentry representing a directory maintains a doubly linked circular list headed by the d_subdirs field that runs through the child dentries' d_child fields. Each child's d_parent pointer references its parent. The mountpoint (dentry b) does not reference the mounted filesystem directly. Instead, the mountpoint's d_mounted flag is set, and dcache looks up the mounted filesystem in mount_hashtable, a process that is described later.

Figure 3. dcache Representation of the Example Filesystem Tree

Although one could search the d_subdirs lists directly, this would be a slow process for large directories. Instead, __d_lookup() hashes the parent directory's dentry pointer and the child's name, searching the global dentry_hashtable for the corresponding dentry. This hash table is shown in Figure 4, along with the LRU list headed by dentry_unused. Any dentry in the LRU list usually is in the hash table as well. Exceptions include cases where parent directories time out, such as in distributed filesystems like NFS.

Figure 4. dentry Hash Table

Each dentry references its inode using the d_inode pointer. This d_inode pointer can be NULL for negative dentries, which lack an inode. Negative dentries can be generated when a filesystem removes a dentry's file or when someone tries to lock a non-existent file. Negative dentries can improve system performance by failing repeated accesses to a given non-existent file without having to invoke the underlying filesystem. Similarly, hard links result in multiple dentries sharing an inode, as shown in Figure 5.

Figure 5. Hard-Link Alias Chains

Figure 6 shows a high-level dentry state diagram. The normal dentry's life goes as follows:

  1. d_alloc() allocates a new dentry for a newly referenced file, leading to state New.

  2. d_add() associates the new dentry with its name and inode, leading to state Hashed.

  3. When done with the file, d_put() adds the dentry to the LRU list and sets its DCACHE_REFERENCED bit in its d_vfs_flags field, leading to state LRU Ref (Hashed).

  4. If the file is referenced again while in the LRU Ref (Hashed) state, dget_locked(), usually called from d_lookup(), marks it in use. If it still is in use at the next prune_dcache() invocation, it is removed from the LRU list, leading again to state Hashed.

  5. Otherwise, prune_dcache() eventually removes the DCACHE_REFERENCED bit from the d_vfs_flags field, leading to state LRU (Hashed).

  6. As before, if the file is referenced again, dget_locked() marks it in use so that prune_dcache() can remove it from the LRU list, leading again to state Hashed.

  7. Otherwise, the second consecutive call to prune_dcache() invokes d_free() via prune_one_dentry(), resulting in state Dead.

Other paths through Figure 6 are possible. For example, if a distributed filesystem converts a cached file handle into a new dentry, it invokes d_alloc_anon() to allocate the dentry when the corresponding object's parent is no longer represented in the dentry cache. Similarly, using d_delete() to delete the file or directory underlying a given dentry would move that dentry to the Negative state. On last close, it would be advanced to “Dead”.

Figure 6. dentry State Diagram

Figure 7 shows the mount_hashtable data structure used to map the mountpoint dentry to the struct vfsmount of the mounted filesystem. This mapping hashes the pointer to the mountpoint dentry and the pointer to the struct vfsmount for the filesystem containing the mountpoint. This combination of dentry pointer and struct vfsmount allows multiple mounts on the same mountpoint to be handled more gracefully.

Figure 7. Traversing Mountpoints

The example filesystem layout shown in Figure 2 would result in struct vfsmount structures as shown in Figure 8. The vfs1 structure references the root dentry r1 both as the mnt_mountpoint and the mnt_root, because this filesystem is the ultimate root of the filesystem tree. The vfs2 structure references dentry b as its mnt_mountpoint and r2 as its mnt_root. Thus, when the mount_hashtable lookup returns a pointer to vfs2, the mnt_root field quickly locates the root of the mounted filesystem.

Figure 8. VFS Mount Tree

The overall shape of the mounted filesystems is reflected in the mnt_mount/mnt_child lists. These lists are used by functions such as copy_tree() while doing loopback mount, which need to traverse all the filesystems mounted in a particular subtree of the overall pathname namespace.