Kernel Korner - Storage Improvements for 2.6 and 2.7

The Linux 2.6 kernel has improved Linux's storage capabilities with advances such as the anticipatory I/O scheduler and support for storage arrays and distributed filesystems.
Invalidating Pages

Suppose that two processes on the same system mmap() the same file. Each sees a coherent view of the other's memory writes in real time. If a distributed filesystem is to provide local semantics faithfully, it needs to combine coherently the memory writes of processes mmap()ing the file from different nodes. These processes cannot have write access simultaneously to the file's pages, because there then would be no reasonable way to combine the changes.

The usual solution to this problem is to make the nodes' MMUs do the dirty work using so-called distributed shared memory. The idea is only one of the nodes allows writes at any given time. Of course, this currently means that only one node may have any sort of access to a given page of a given file at a time, because a page can be promoted from read-only to writable without the underlying filesystem having a say in the matter.

When some other node's process takes a page fault, say, at offset 0x1234 relative to the beginning of the file, it must send a message to the node that currently has the writable copy. That node must remove the page from any user processes that have it mmap()ed. In the 2.4 kernel, the distributed filesystem must reach into the bowels of the VM system to accomplish this, but the 2.6 kernel provides an API, which the second node may use as follows:

invalidate_mmap_range(inode->mapping, 0x1234, 0x4);

The contents of the page then may be shipped to the first node, which can map it into the address space of the faulting process. Readers familiar with CPU architecture should recognize the similarity of this step to cache-coherence protocols. This process is quite slow, however, as data must be moved over some sort of network in page-sized chunks. It also may need to be written to disk along the way.

Challenges remaining in the 2.6 kernel include permitting processes on multiple nodes to map efficiently a given page of a given file as read-only, which requires that the filesystem be informed of write attempts to read-only mappings. In addition, the 2.6 kernel also must permit the filesystem to determine efficiently which pages have been ejected by the VM system. This allows the distributed filesystem to do a better job of figuring out which pages to evict from memory, as evicting pages no longer mapped by any user process is a reasonable heuristic—if you efficiently can work out which pages those are.

NFS Lock Requests

The current implementation of NFS lockd uses a per-server lock-state database. This works quite well when exporting a local filesystem, because the locking state is maintained in RAM. However, if NFS is used to export the same distributed filesystem from two different nodes, we end up with the situation shown in Figure 2. Both nodes, running independent copies of lockd, could hand out the same lock to two different NFS clients. Needless to say, this sort of thing could reduce your application's uptime.

Figure 2. One lock, two clients, big trouble.

One straightforward way of fixing this is to have lockd acquire a lock against the underlying filesystem, permitting the distributed filesystem to arbitrate concurrent NFS lock requests correctly. However, lockd is single-threaded, so if the distributed filesystem were to block while evaluating the request from lockd, NFS locking would be stalled. And distributed filesystems plausibly might block for extended periods of time while recovering from node failures, retransmitting due to lost messages and so on.

A way to handle this is to use multithread lockd. Doing so adds complexity, though, because the different threads of lockd must coordinate in order to avoid handing out the same lock to two different clients at the same time. In addition, there is the question of how many threads should be provided.

Nonetheless, patches exist for these two approaches, and they have seen some use. Other possible approaches include using the 2.6 kernel's generic work queues instead of threads or requiring the underlying filesystem to respond immediately but permitting it to say “I don't know, but will tell you as soon as I find out”. This latter approach would allow filesystems time to sort out their locks while avoiding stalling lockd.

Don't Kill the Garbage Collector

Some distributed filesystems use special threads whose job it is to free up memory containing cached file state no longer in use, similar to the manner in which bdflush writes out dirty blocks. Clearly, killing such a thread is somewhat counterproductive, so such threads should be exempt from the out-of-memory killer oom_kill().

The trick in the 2.6 kernel is to set the CAP_SYS_RAWIO and the CAP_SYS_ADMIN capabilities by using the following:


Here, current indicates the currently running thread. This causes oom_kill() to avoid this thread, if it does choose it, to use SIGTERM rather than SIGKILL. The thread may catch or ignore SIGTERM, in which case oom_kill() marks the thread so as to refrain from killing it again.