Writing Stackable Filesystems
Writing filesystems, or any kernel code, is hard. The kernel is a complex environment to master, and small mistakes can cause severe data corruption. Filesystems, however, offer a clean data access mechanism that is transparent to user applications, which is why developers always desire to add new features to filesystems. In this article, we provide a quick introduction so you can add new functionality to existing filesystems without having to become a kernel or filesystems expert.
Although Linux supports many filesystems, they are pretty similar: disk-based filesystems, network-based filesystems, etc. Making a filesystem stable and efficient takes years of effort, and once it's stable and working, you don't want to break it by throwing in new features. Besides, maintainers of filesystems rarely accept feature-enhancement patches to their stable filesystems. So, it is no surprise that the most popular filesystems currently in use have not fundamentally changed in years.
Suppose you want to write a simple encryption filesystem that uses a single fixed cipher key to encrypt file data. Getting portable C code for various ciphers is easy. Next, you have to tie the calls to encrypt and decrypt data buffers into the filesystem. Conceptually the problem is simple: encrypt any data that comes from the write system call before it is written to disk, and decrypt any data that comes from the disk before it is passed back to the user process that called the read system call.
Your first inclination might be to copy the 5,000+ lines of source code for ext2, study it and then add your cipher calls to it. You should resist the urge to copy a whole other filesystem as a starting point. Although it's only 5,000+ lines of code, kernel code is at least an order of magnitude more complex to develop than user-level code. If you actually end up putting the calls to your cipher in the right place in this new filesystem, you'll find you spent most of your time studying it, only to add a small number of lines in some places. Even so, now you've got yourself a single encrypting ext2 filesystem. What if you want an encrypting NFS filesystem or any one of the plethora of other Linux filesystems?
Linux, like most OSes, separates its filesystem code into two components: native filesystems (ext2, NFS, etc.) and a general-purpose layer called the virtual filesystem (VFS). The VFS is a layer that sits between system call entry points and native filesystems. The VFS provides a uniform access mechanism to filesystems without needing to know the details of those filesystems. When filesystems are initialized in the kernel, they install a set of function pointers (methods in OO-speak) for the VFS to use. The VFS, in turn, calls these pointer functions generically, without knowing which specific filesystem the pointers represent. For example, an unlink system call gets translated into a service routine sys_unlink, which invokes the vfs_unlink VFS function, which invokes a filesystem-specific method by using its installed function pointer: ext2_unlink for ext2, nfs_unlink for NFS or the appropriate function for other filesystems. Throughout this article, we refer to the specific filesystem method using ->, as in ->unlink().
To solve this problem of how to develop our encryption filesystem quickly, we employ the following adage: “Any problem in computer science can be solved by adding another level of indirection.” Luckily, the Linux VFS allows another filesystem to be inserted right between the VFS and another filesystem. Figure 1 shows such a stackable encryption filesystem called Cryptfs. Cryptfs is called stackable because it stacks on top of another filesystem (ext2). Here, the VFS calls Cryptfs' ->write() method (cryptfs_write); Cryptfs encrypts the user data it receives and passes it down by calling the ->write() method below (ext2_write).

Figure 1. An Example Stackable Encryption Filesystem
In general, stackable filesystems can stand alone and be mounted on top of any other existing filesystem mountpoint; this means you only have to develop your (stackable) filesystem once, and it will work with any other native (low-level) filesystem such as ext2, NFS, etc. Moreover, as of Linux 2.4.20, stackable filesystems even can be exported safely (via nfs-utils-1.0 or newer) to remote NFS clients.
The basic function of a stackable filesystem is to pass an operation and its arguments to the lower-level filesystem. The following distilled code snippet shows how a stackable null-mode pass-through filesystem called Wrapfs handles the ->unlink() operation:
int wrapfs_unlink(struct inode *dir,
struct dentry *dentry)
{
int err = 0;
struct inode *lower_dir;
struct dentry *lower_dentry;
lower_dir = get_lower_inode(dir);
lower_dentry = get_lower_dentry(dentry);
/* pre-call code can go here */
err = lower_dir->i_op->unlink(lower_dir,
lower_dentry);
/* post-call code can go here */
return err;
}
When the VFS needs to unlink a file in a Wrapfs filesystem, it calls wrapfs_unlink, passing it the inode of the directory in which the file to remove resides (dir) and the name of the entry to remove (encapsulated in dentry).
Every filesystem keeps a set of objects that belong to it, including inodes, directory entries and open files. When using stacking, multiple objects represent the same file—only at different layers. For example, our Cryptfs in Figure 1 may have to keep a directory entry (dentry) object with the clear-text version of the filename, while ext2 will keep another dentry with the ciphertext (encrypted) version of the same name. To be truly transparent to the VFS and other filesystems, stackable filesystems keep multiple objects at each level.
This is why the first few actions that wrapfs_unlink takes are to locate, from the arguments it gets, the inode and dentry that correspond to the same objects, only at the filesystem mounted below. These get_lower_* functions essentially follow pointers that previously were stored in the private fields of Wrapfs' objects when those objects were created. Once the lower objects are located, the main magic of stacking takes place. We call the lower-level filesystem's own ->unlink() method, through the lower-level directory inode, and pass it the two lower objects.
Wrapfs is a full-fledged stackable null-layer (or loopback) filesystem that simply passes all operations and objects (unmodified) between the VFS and the lower filesystem. Wrapfs itself, however, is not easy to write for one main reason; it has to treat the lower filesystem as if it were the VFS, while appearing to the real Linux VFS as a lower-level filesystem. This dual role requires careful handling of locks, reference counts, allocated memory and so on. Luckily, someone already wrote and maintains Wrapfs. Therefore, Wrapfs serves as an excellent template for you to modify and add new functionality.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- Designing Electronics with Linux
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Dynamic DNS—an Object Lesson in Problem Solving
- Using Salt Stack and Vagrant for Drupal Development
- Validate an E-Mail Address with PHP, the Right Way
- Tech Tip: Really Simple HTTP Server with Python
- Build a Skype Server for Your Home Phone System
- Why Python?
- A Topic for Discussion - Open Source Feature-Richness?
- Reply to comment | Linux Journal
38 min 31 sec ago - Not free anymore
4 hours 40 min ago - Great
8 hours 27 min ago - Reply to comment | Linux Journal
8 hours 35 min ago - Understanding the Linux Kernel
10 hours 50 min ago - General
13 hours 20 min ago - Kernel Problem
23 hours 22 min ago - BASH script to log IPs on public web server
1 day 3 hours ago - DynDNS
1 day 7 hours ago - Reply to comment | Linux Journal
1 day 7 hours ago




Comments
Nicely demonstrated
Nicely demonstrated stackable file systems.
However, in real applications it is hard to keep the two layers (crypt fs and underlying low level file system) separate.
Re: Kernel Korner: Writing Stackable Filesystems
Really nice article, although would be thrilled to see a more followup of the same!
amazing article...
I must congratulate you for an aticle that is simple and addresses the core of the issues relating to stacking..
keep posting new articles ..
Pradeep