Kernel Korner - Storage Improvements for 2.6 and 2.7
Suppose that two processes on the same system mmap() the same file. Each sees a coherent view of the other's memory writes in real time. If a distributed filesystem is to provide local semantics faithfully, it needs to combine coherently the memory writes of processes mmap()ing the file from different nodes. These processes cannot have write access simultaneously to the file's pages, because there then would be no reasonable way to combine the changes.
The usual solution to this problem is to make the nodes' MMUs do the dirty work using so-called distributed shared memory. The idea is only one of the nodes allows writes at any given time. Of course, this currently means that only one node may have any sort of access to a given page of a given file at a time, because a page can be promoted from read-only to writable without the underlying filesystem having a say in the matter.
When some other node's process takes a page fault, say, at offset 0x1234 relative to the beginning of the file, it must send a message to the node that currently has the writable copy. That node must remove the page from any user processes that have it mmap()ed. In the 2.4 kernel, the distributed filesystem must reach into the bowels of the VM system to accomplish this, but the 2.6 kernel provides an API, which the second node may use as follows:
invalidate_mmap_range(inode->mapping, 0x1234, 0x4);
The contents of the page then may be shipped to the first node, which can map it into the address space of the faulting process. Readers familiar with CPU architecture should recognize the similarity of this step to cache-coherence protocols. This process is quite slow, however, as data must be moved over some sort of network in page-sized chunks. It also may need to be written to disk along the way.
Challenges remaining in the 2.6 kernel include permitting processes on multiple nodes to map efficiently a given page of a given file as read-only, which requires that the filesystem be informed of write attempts to read-only mappings. In addition, the 2.6 kernel also must permit the filesystem to determine efficiently which pages have been ejected by the VM system. This allows the distributed filesystem to do a better job of figuring out which pages to evict from memory, as evicting pages no longer mapped by any user process is a reasonable heuristic—if you efficiently can work out which pages those are.
The current implementation of NFS lockd uses a per-server lock-state database. This works quite well when exporting a local filesystem, because the locking state is maintained in RAM. However, if NFS is used to export the same distributed filesystem from two different nodes, we end up with the situation shown in Figure 2. Both nodes, running independent copies of lockd, could hand out the same lock to two different NFS clients. Needless to say, this sort of thing could reduce your application's uptime.
One straightforward way of fixing this is to have lockd acquire a lock against the underlying filesystem, permitting the distributed filesystem to arbitrate concurrent NFS lock requests correctly. However, lockd is single-threaded, so if the distributed filesystem were to block while evaluating the request from lockd, NFS locking would be stalled. And distributed filesystems plausibly might block for extended periods of time while recovering from node failures, retransmitting due to lost messages and so on.
A way to handle this is to use multithread lockd. Doing so adds complexity, though, because the different threads of lockd must coordinate in order to avoid handing out the same lock to two different clients at the same time. In addition, there is the question of how many threads should be provided.
Nonetheless, patches exist for these two approaches, and they have seen some use. Other possible approaches include using the 2.6 kernel's generic work queues instead of threads or requiring the underlying filesystem to respond immediately but permitting it to say “I don't know, but will tell you as soon as I find out”. This latter approach would allow filesystems time to sort out their locks while avoiding stalling lockd.
Some distributed filesystems use special threads whose job it is to free up memory containing cached file state no longer in use, similar to the manner in which bdflush writes out dirty blocks. Clearly, killing such a thread is somewhat counterproductive, so such threads should be exempt from the out-of-memory killer oom_kill().
The trick in the 2.6 kernel is to set the CAP_SYS_RAWIO and the CAP_SYS_ADMIN capabilities by using the following:
cap_raise(current->cap_effective,
CAP_SYS_ADMIN|CAP_SYS_RAWIO);
Here, current indicates the currently running thread. This causes oom_kill() to avoid this thread, if it does choose it, to use SIGTERM rather than SIGKILL. The thread may catch or ignore SIGTERM, in which case oom_kill() marks the thread so as to refrain from killing it again.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- Designing Electronics with Linux
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Dynamic DNS—an Object Lesson in Problem Solving
- Using Salt Stack and Vagrant for Drupal Development
- Build a Skype Server for Your Home Phone System
- New Products
- Why Python?
- A Topic for Discussion - Open Source Feature-Richness?
- Validate an E-Mail Address with PHP, the Right Way
- Tech Tip: Really Simple HTTP Server with Python
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?





1 hour 48 min ago
4 hours 18 min ago
14 hours 20 min ago
18 hours 47 min ago
22 hours 23 min ago
22 hours 56 min ago
1 day 1 hour ago
1 day 1 hour ago
1 day 1 hour ago
1 day 5 hours ago