Scaling dcache with RCU
A full parallelization of dcache would be quite complex and was deemed too risky for the latter part of the 2.5 effort. The 2.6 dcache is but one step along the road to RCU; the goal for 2.7 is to walk the entire path without acquiring any locks.
Listing 1. Lock-Free Pathname Segment Lookup
1 struct dentry *
2 __d_lookup(struct dentry * parent,
3 struct qstr * name)
4 {
5 unsigned int len = name->len;
6 unsigned int hash = name->hash;
7 const unsigned char *str = name->name;
8 struct hlist_head *head = d_hash(parent,hash);
9 struct dentry *found = NULL;
10 struct hlist_node *node;
11
12 rcu_read_lock();
13 hlist_for_each (node, head) {
14 struct dentry *dentry;
15 unsigned long move_count;
16 struct qstr * qstr;
17
18 smp_read_barrier_depends();
19 dentry = hlist_entry(node, struct dentry,
20 d_hash);
21 if (unlikely(dentry->d_bucket != head))
22 break;
23 move_count = dentry->d_move_count;
24 smp_rmb();
25 if (dentry->d_name.hash != hash)
26 continue;
27 if (dentry->d_parent != parent)
28 continue;
29 qstr = dentry->d_qstr;
30 smp_read_barrier_depends();
31 if (parent->d_op &&
32 parent->d_op->d_compare) {
33 if (parent->d_op->d_compare(parent, qstr,
34 name))
35 continue;
36 } else {
37 if (qstr->len != len)
38 continue;
39 if (memcmp(qstr->name, str, len))
40 continue;
41 }
42 spin_lock(&dentry->d_lock);
43 /*
44 * If dentry is moved, fail the lookup
45 */
46 if (likely(move_count ==
47 dentry->d_move_count)) {
48 if (!d_unhashed(dentry)) {
49 atomic_inc(&dentry->d_count);
50 found = dentry;
51 }
52 }
53 spin_unlock(&dentry->d_lock);
54 break;
55 }
56 rcu_read_unlock();
57 return found;
58 }
Pathname segment lookup is performed by the __d_lookup() function shown in Listing 1. The __d_lookup() function is invoked with a pointer to the parent directory's dentry and the name to be looked up. The name is passed in a struct qstr, which contains a pointer to the string, its length, a precomputed hash value for the dcache hash table and the name itself, if desired.
Lines 5–7 unmarshall the struct qstr. Line 8 hashes the combination of the name and the parent dentry pointer into the global dcache hash table, yielding a pointer to the corresponding hash chain.
Lines 12 and 56 demark the RCU-protected segment of the code, disabling preemption in CONFIG_PREEMPT kernels, as specified by the Reader-Writer-Lock/RCU analogy described in the article entitled “Using RCU in the Linux 2.5 Kernel” in the October 2003 issue of Linux Journal. Lines 13–55 loop through the elements in the selected hash chain, looking for the matching dentry. Line 18 issues a memory barrier but only on DEC Alpha. On other CPUs, the data dependency implied by the pointer dereference suffices, so on these CPUs, line 18 generates no code.
Because this lookup acquired no locks, it may be racing with a rename system call. Such a system call could move a dentry to another hash chain, taking this lookup with it. Lines 21 and 22 check for this race, but they are not sufficient in and of themselves. Therefore, line 23 takes a snapshot of the number of times that the current dentry has been subjected to a rename, the dcache d_move() function, which is used later to determine if any renames raced with the path walk. Line 24 is a memory barrier to ensure that the snapshot is not reordered by either the compiler or the CPU.
Lines 25–28 check the name hash and the parent dentry. If either fail to match, this dentry cannot be the target of our lookup. Lines 29–41 do the full name comparison, with memory barrier for DEC Alpha at line 30. Filesystem-specific name comparison functions may be provided, for example, for case-independent filesystems, as shown on line 33.
If execution reaches line 42, we have found a child dentry with a matching name. We then acquire the child dentry's lock on line 42. Because we have a lock on each dentry, the level of contention on these individual locks is much lower than on the original dcache_lock. Nonetheless, life is not perfect. For example, the lock on the root dentry still is subject to contention, a topic discussed later.
The child dentry possibly was renamed after the d_move_count snapshot was acquired on line 23. Therefore, lines 46–47 check the current value of d_move_count against the snapshot. If the check passes, the child dentry has not been renamed out from under the lookup, and lines 48–51 increment a reference count—but only if the entry still is hashed.
Line 53 releases the child dentry's lock, and line 54 breaks out of the hash-chain search loop. Line 57 returns a pointer to the child dentry if the lookup was successful, or NULL otherwise.
A failure of __d_lookup() does not mean that failure is returned to the user process. The file actually may exist, but has not been loaded yet into dcache.
This function does not protect, however, against all rename-race hazards. One additional race is caused by the fact that dcache uses hlist rather than list for the dcache hash chains. It uses hlist to save memory, because hlist requires one rather than two pointers in the list header. This does mean, though, that hlist, unlike list, is not circular. It therefore is possible that a particular dentry will be renamed such that it lands in a previously empty dcache hash chain. If this happened at the right time, the __d_lookup() function could return search failure incorrectly.
An incorrectly returned search failure is handled by the upper-level d_lookup() function, shown in Listing 2. Any racing renames are detected by the read_seqretry() function on line 13. As the problematic case results only in spurious failure, the check is made only on NULL return from __d_lookup().
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- RSS Feeds
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- Developer Poll
- Dart: a New Web Programming Experience
- May 2013 Issue of Linux Journal: Raspberry Pi
- What's the tweeting protocol?
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




33 min 7 sec ago
2 hours 6 min ago
3 hours 43 min ago
5 hours 41 min ago
5 hours 58 min ago
6 hours 28 min ago
6 hours 29 min ago
6 hours 29 min ago
9 hours 30 min ago
17 hours 56 min ago