Pgfs: The PostGres File System

The details of how Pgfs came to be written and how it can save you disk space.

The PostGres file system is a Linux NFS server that presents software versions as distinct file trees in an NFS file system. Each version is completely distinct from all other versions, and can be modified independently without regard to versions before or after it. Each version retains all of the properties it would have on a normal file system, such as file ownership, permissions, binary file contents, cross-directory hard links and non-files such as devices and symlinks. The effect is the same as if each version of the software had its own separate directory, except far less disk space is used.

Design Motivation for Pgfs

As an example, let's say that a year ago, you picked your favorite Linux distribution, and installed it on your new computer. The distribution had about 15,000 files, and took up about 200MB on disk. During the year you did a lot of hacking on your software, and it is now quite different from the original distribution. These modifications were done incrementally, and some of them replaced the original binaries with completely new binaries, for instance upgrading sendmail(8) or ftpd(8). Now you wish to compare your machine to the original distribution, examine the changes you've made, and apply them to a new distribution of Linux.

How do you record the changes you've made? You could save a complete copy of your distribution every time after you modified it—doing that would consume 200MB * 100 mods = 20 GB of disk. Even using a pair of 9 GB drives that's awfully expensive. However, you notice that most modifications change only a few files—perhaps a total of a half MB per modification. Storing only the files that changed would use 200MB + (100 mods * 0.5MB/mod) = 250MB of disk space—that's much better. What application would store only the differences for you? You could use CVS, but CVS isn't really suitable.

Now let's say you are a systems administrator, and you've faced this version control problem daily for years without finding a satisfactory solution. So since you're also a developer, you decide to build an application to store similar file trees that exploits the compression opportunities you've found. Fundamentally, this application would need to eat file trees and spit them back out again, and it must use less disk than keeping a whole copy of each tree. It should accept files one at a time or in bulk. You shouldn't have to extract and resubmit a whole file tree just to make one change.

How would you implement this application? Start by deciding what data structures are needed and what routines are needed to manipulate the structures. Let's start with files. Files consist of two parts—a stat(2) structure and a big chunk of binary data. Suppose you store the binary data in a file and name this file with a number. Then, name the stat structures with a number and store the fixed-length structures in an array on disk. The structures can be broken apart using field-splitting routines and assembled using record-making routines. Next, a structure is needed to represent the different versions of your software. Each version of your operating system consists of a tree of files. You name a tree of files that represent one specific software version a “version set” or “verset” and number them.

The next thing needed is some routines to search the stat array on disk for a specific structure, add structures and delete them. Since you will be doing random access to the structures, store them in a dbm (database management) file and use the dbm access routines instead of writing your own. Dbm also gives you routines to handle an index into the stat structure numbers, in order to make your access faster. You will need to write maintenance routines to copy dbm files and to copy fields from one dbm file to another with a different structure layout.

To add a new stat structure into your array may require modifying fields in structures other than the one you're adding; for instance, when you add one file to an existing file tree. Your programming task would be a lot simpler if you could collect a bunch of these modifications and do them all at once, or not do any of them if you discover a problem. The idea of doing all or none of a complex modification is called “transactions”.

To use your application you need commands for it to accept. Some commands might be “add this whole tree of files”, “add this single file” and “replace some bytes of a file with these bytes”. The NFS people have figured out the minimal set of file operations you need. (See Sidebar 1.) Now decide how stat(2) structures will be modified for each of the file operations and write pseudo code to modify the stat(2) array. While designing the semantics of the NFS commands, start thinking about sending your application NFS commands, making it an NFS server.

Next write your application. How about using an SQL database? A database decouples the application data structures from their representation on disk giving you the following advantages:

  1. Structures can be defined with arbitrary fields and stored in tables.

  2. A full set of routines to add, delete and modify structures is available, as well as indices to find structures quickly.

  3. A nice command language is available to translate between structure formats as the application evolves.

  4. Routines in the database are designed to operate on chunks of data that won't fit in memory all at once, so your application can grow without problems.

To add a field called “cokecans” to tally the number of cans of Coke it took to create each file, just add it. You can transfer your existing data to a new table with the cokecans field in a couple lines of SQL. Compare this to coding in C where a bunch of custom binary-format conversion programs would have to be written.

Then, find the skeleton of a user-level NFS server and port it to Linux (see Sidebar 2), and hook up the source of NFS commands to the command input of your application. Now you have an NFS server that presents file trees, but compresses away the similarities between trees. Since your application can be used like any file system, you don't have to build any specialized programs to manipulate versions—you can search files with grep and compare file trees with diff.

To control your application, create some fake magic filenames that it can treat as special, like procfs. The lines that are written to these files are the commands to your application. Now your application can be controlled with echo(1) commands from the shell rather than some obscure socket protocol.

The above description is not exactly how I went about writing Pgfs, but it does outline the design motivation. After I tried to store copies of a BSDI distribution under CVS and failed in practical terms, I set out to write an NFS server implemented on a database. My first version was coded in Perl5 using the PostGres client library, and I typed in NFS commands as space-separated text strings. I recoded in C to pick up the NFS RPCs. My first database design schema used one table for “names”--holding filenames and symlinks, and another table for “inodes”--holding the rest of the stat(2) structure and the pointer to the file contents. However, I didn't like the join operation (i.e., matching up rows from two tables with the same key), and I didn't want to implement join either in the database or the application code.