Writing Stackable Filesystems
Writing filesystems, or any kernel code, is hard. The kernel is a complex environment to master, and small mistakes can cause severe data corruption. Filesystems, however, offer a clean data access mechanism that is transparent to user applications, which is why developers always desire to add new features to filesystems. In this article, we provide a quick introduction so you can add new functionality to existing filesystems without having to become a kernel or filesystems expert.
Although Linux supports many filesystems, they are pretty similar: disk-based filesystems, network-based filesystems, etc. Making a filesystem stable and efficient takes years of effort, and once it's stable and working, you don't want to break it by throwing in new features. Besides, maintainers of filesystems rarely accept feature-enhancement patches to their stable filesystems. So, it is no surprise that the most popular filesystems currently in use have not fundamentally changed in years.
Suppose you want to write a simple encryption filesystem that uses a single fixed cipher key to encrypt file data. Getting portable C code for various ciphers is easy. Next, you have to tie the calls to encrypt and decrypt data buffers into the filesystem. Conceptually the problem is simple: encrypt any data that comes from the write system call before it is written to disk, and decrypt any data that comes from the disk before it is passed back to the user process that called the read system call.
Your first inclination might be to copy the 5,000+ lines of source code for ext2, study it and then add your cipher calls to it. You should resist the urge to copy a whole other filesystem as a starting point. Although it's only 5,000+ lines of code, kernel code is at least an order of magnitude more complex to develop than user-level code. If you actually end up putting the calls to your cipher in the right place in this new filesystem, you'll find you spent most of your time studying it, only to add a small number of lines in some places. Even so, now you've got yourself a single encrypting ext2 filesystem. What if you want an encrypting NFS filesystem or any one of the plethora of other Linux filesystems?
Linux, like most OSes, separates its filesystem code into two components: native filesystems (ext2, NFS, etc.) and a general-purpose layer called the virtual filesystem (VFS). The VFS is a layer that sits between system call entry points and native filesystems. The VFS provides a uniform access mechanism to filesystems without needing to know the details of those filesystems. When filesystems are initialized in the kernel, they install a set of function pointers (methods in OO-speak) for the VFS to use. The VFS, in turn, calls these pointer functions generically, without knowing which specific filesystem the pointers represent. For example, an unlink system call gets translated into a service routine sys_unlink, which invokes the vfs_unlink VFS function, which invokes a filesystem-specific method by using its installed function pointer: ext2_unlink for ext2, nfs_unlink for NFS or the appropriate function for other filesystems. Throughout this article, we refer to the specific filesystem method using ->, as in ->unlink().
To solve this problem of how to develop our encryption filesystem quickly, we employ the following adage: “Any problem in computer science can be solved by adding another level of indirection.” Luckily, the Linux VFS allows another filesystem to be inserted right between the VFS and another filesystem. Figure 1 shows such a stackable encryption filesystem called Cryptfs. Cryptfs is called stackable because it stacks on top of another filesystem (ext2). Here, the VFS calls Cryptfs' ->write() method (cryptfs_write); Cryptfs encrypts the user data it receives and passes it down by calling the ->write() method below (ext2_write).

Figure 1. An Example Stackable Encryption Filesystem
In general, stackable filesystems can stand alone and be mounted on top of any other existing filesystem mountpoint; this means you only have to develop your (stackable) filesystem once, and it will work with any other native (low-level) filesystem such as ext2, NFS, etc. Moreover, as of Linux 2.4.20, stackable filesystems even can be exported safely (via nfs-utils-1.0 or newer) to remote NFS clients.
The basic function of a stackable filesystem is to pass an operation and its arguments to the lower-level filesystem. The following distilled code snippet shows how a stackable null-mode pass-through filesystem called Wrapfs handles the ->unlink() operation:
int wrapfs_unlink(struct inode *dir,
struct dentry *dentry)
{
int err = 0;
struct inode *lower_dir;
struct dentry *lower_dentry;
lower_dir = get_lower_inode(dir);
lower_dentry = get_lower_dentry(dentry);
/* pre-call code can go here */
err = lower_dir->i_op->unlink(lower_dir,
lower_dentry);
/* post-call code can go here */
return err;
}
When the VFS needs to unlink a file in a Wrapfs filesystem, it calls wrapfs_unlink, passing it the inode of the directory in which the file to remove resides (dir) and the name of the entry to remove (encapsulated in dentry).
Every filesystem keeps a set of objects that belong to it, including inodes, directory entries and open files. When using stacking, multiple objects represent the same file—only at different layers. For example, our Cryptfs in Figure 1 may have to keep a directory entry (dentry) object with the clear-text version of the filename, while ext2 will keep another dentry with the ciphertext (encrypted) version of the same name. To be truly transparent to the VFS and other filesystems, stackable filesystems keep multiple objects at each level.
This is why the first few actions that wrapfs_unlink takes are to locate, from the arguments it gets, the inode and dentry that correspond to the same objects, only at the filesystem mounted below. These get_lower_* functions essentially follow pointers that previously were stored in the private fields of Wrapfs' objects when those objects were created. Once the lower objects are located, the main magic of stacking takes place. We call the lower-level filesystem's own ->unlink() method, through the lower-level directory inode, and pass it the two lower objects.
Wrapfs is a full-fledged stackable null-layer (or loopback) filesystem that simply passes all operations and objects (unmodified) between the VFS and the lower filesystem. Wrapfs itself, however, is not easy to write for one main reason; it has to treat the lower filesystem as if it were the VFS, while appearing to the real Linux VFS as a lower-level filesystem. This dual role requires careful handling of locks, reference counts, allocated memory and so on. Luckily, someone already wrote and maintains Wrapfs. Therefore, Wrapfs serves as an excellent template for you to modify and add new functionality.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- RSS Feeds
- Trying to Tame the Tablet
- New Products
- What's the tweeting protocol?
- Dart: a New Web Programming Experience
- Hey God - You may not be
2 min 14 sec ago - Reply to comment | Linux Journal
2 hours 34 min ago - Drupal is an Awesome CMS and a Crappy development framework
7 hours 13 min ago - IT industry leaders
9 hours 36 min ago - Reply to comment | Linux Journal
1 day 2 hours ago - Reply to comment | Linux Journal
1 day 4 hours ago - Reply to comment | Linux Journal
1 day 6 hours ago - great post
1 day 6 hours ago - Google Docs
1 day 7 hours ago - Reply to comment | Linux Journal
1 day 12 hours ago




Comments
Nicely demonstrated
Nicely demonstrated stackable file systems.
However, in real applications it is hard to keep the two layers (crypt fs and underlying low level file system) separate.
Re: Kernel Korner: Writing Stackable Filesystems
Really nice article, although would be thrilled to see a more followup of the same!
amazing article...
I must congratulate you for an aticle that is simple and addresses the core of the issues relating to stacking..
keep posting new articles ..
Pradeep