The “Virtual File System” in Linux
The main data item in any Unix-like system is the “file”, and a unique path name identifies each file within a running system. Every file appears like any other file in the way it is accessed and modified: the same system calls and the same user commands apply to every file. This applies independently of both the physical medium that holds information and the way information is laid out on the medium. Abstraction from the physical storage of information is accomplished by dispatching data transfer to different device drivers. Abstraction from the information layout is obtained in Linux through the VFS implementation.
Linux looks at its file system in the same way Unix does—adopting the concepts of super block, inode, directory and file. The tree of files accessible at any time is determined by how the different parts are assembled, each part being a partition of the hard drive or other physical storage device that is “mounted” to the system.
While the reader is assumed to be well acquainted with the concept of mounting a file system, I'll detail the concepts of super block, inode, directory and file.
The super block owes its name to its heritage, from when the first data block of a disk or partition was used to hold meta information about the partition itself. The super block is now detached from the concept of data block, but it still contains information about each mounted file system. The actual data structure in Linux is called struct super_block and holds various housekeeping information, like mount flags, mount time and device block size. The 2.0 kernel keeps a static array of such structures to handle up to 64 mounted file systems.
An inode is associated with each file. Such an “index node” holds all the information about a named file except its name and its actual data. The owner, group, permissions and access times for a file are stored in its inode, as well as the size of the data it holds, the number of links and other information. The idea of detaching file information from file name and data is what allows the implementation of hard-links—and the use of “dot” and “dot-dot” notations for directories without any need to treat them specially. An inode is described in the kernel by a struct inode.
The directory is a file that associates inodes to file names. The kernel has no special data structure to represent a directory, which is treated like a normal file in most situations. Functions specific to each file system type are used to read and modify the contents of a directory independently of the actual layout of its data.
The file itself is associated with an inode. Usually files are data areas, but they can also be directories, devices, fifos (first-in-first-out) or sockets. An “open file” is described in the Linux kernel by a struct file item; the structure holds a pointer to the inode representing the file. file structures are created by system calls like open, pipe and socket, and are shared by father and child across fork.
While the previous list describes the theoretical organization of information, an operating system must be able to deal with different ways to layout information on disk. While it is theoretically possible to look for an optimum layout of information on disks and use it for every disk partition, most computer users need to access all of their hard drives without reformatting, to mount NFS volumes across the network, and to sometimes even access those funny CD-ROMs and floppy disks whose file names can't exceed 8+3 characters.
The problem of handling different data formats in a transparent way has been addressed by making super blocks, inodes and files into “objects”; an object declares a set of operations that must be used to deal with it. The kernel won't be stuck into big switch statements to be able to access the different physical layouts of data, and new file system types can be added and removed at run time.
The entire VFS idea, therefore, is implemented around sets of operations to act on the objects. Each object includes a structure declaring its own operations, and most operations receive a pointer to the “self” object as the first argument, thus allowing modification of the object itself.
In practice, a super block structure encloses a field struct super_operations *s_op, an inode encloses struct inode_operations *i_op and a file encloses struct file_operations *f_op.
All the data handling and buffering performed by the Linux kernel is independent of the actual format of the stored data. Every communication with the storage medium passes through one of the operations structures. The file system type, then, is the software module which is in charge of mapping the operations to the actual storage mechanism—either a block device, a network connection (NFS) or virtually any other means of storing and retrieving data. These modules can either be linked to the kernel being booted or compiled as loadable modules.
The current implementation of Linux allows use of loadable modules for all file system types but root (the root file system must be mounted before loading a module from it). Actually, the initrd machinery allows loading of a module before mounting the root file system, but this technique is usually exploited only on installation floppies.
In this article I use the phrase “file system module” to refer either to a loadable module or a file system decoder linked to the kernel.
This is in summary how all file handling happens for any given file system type, and is depicted in Figure 1:
struct file_system_type is a structure that declares only its own name and a read_super function. At mount time, the function is passed information about the storage medium being mounted and is asked to fill a super block structure, as well as loading the inode of the root directory of the file system as sb->s_mounted (where sb is the super-block just filled). The additional field requires_dev is used by the file system type to state whether it will access a block device: for example, the NFS and proc types don't require a device, while ext2 and iso9660 do. After the super block is filled, struct file_system_type is not used any more; only the super block just filled will hold a pointer to it in order to be able to give back status information to the user (/proc/mounts is an example of such information). The structure is shown in Listing 1.
The super_operations structure is used by the kernel to read and write inodes, write super block information back to disk and collect statistics (to deal with the statfs and fstatfs system calls). When a file system is eventually unmounted, the put_super operation is called—in standard kernel wording “get” means “allocate and fill”, “read” means “fill” and “put” means “release”. The super_operations declared by each file system type are shown in Listing 2.
After a memory copy of the inode has been created, the kernel will act on it using its own operations. struct inode_operations is the second set of operations declared by file system modules, and is listed below; they deal mainly with the directory tree. Directory-handling operations are part of the inode operations because the implementation of a dir_operations structure would bring in extra conditionals in file system access. Instead, inode operations that only make sense for directories will do their own error checking. The first field of the inode operations defines the file operations for regular files. If the inode is a fifo, a socket or a device-specific file operation will be used. Inode operations appear in Listing 3; note the definition of rename was changed in release 2.0.1.
The file_operations, finally, specify how data in the actual file is handled: the operations implement the low-level details of read, write, lseek and the other data-handling system calls. Since the same file_operations structure is used to act on devices, it also includes some fields that only make sense for character or block devices. It's interesting to note that the structure shown here is the structure declared in the 2.0 kernels, while 2.1 changed the prototypes of read, write and lseek to allow a wider range of file offsets. The file operations (as of 2.0) are shown in Listing 4.
- Hacking a Safe with Bash
- Django Models and Migrations
- Secure Server Deployments in Hostile Territory, Part II
- Home Automation with Raspberry Pi
- The Controversy Behind Canonical's Intellectual Property Policy
- Huge Package Overhaul for Debian and Ubuntu
- Shashlik - a Tasty New Android Simulator
- Embed Linux in Monitoring and Control Systems
- KDE Reveals Plasma Mobile
- diff -u: What's New in Kernel Development