Pgfs: The PostGres File System
Here's a description of the real Pgfs program that you can download. Pgfs is a normal user-level program that reads and writes ordinary TCP streams and UDP packets. Since it is a normal program that requires no privileges, it can run on any Linux system. It doesn't use any ground breaking system call features, so no kernel modifications are necessary. The TCP stream packets are generated by the PostGres client library, so Pgfs can interact with a PostGres database using SQL. The UDP packets are formatted by the conventions of the NFS protocol. All this means is that an NFS client such as a Linux kernel can choose to send NFS packets Pgfs' way, and can mount a file system as if Pgfs were any other variety of NFS server. The AMD automounter is another example of a user-level program that acts as an NFS server. AMD responds to the directory-browsing NFS operations that trigger an automounter response, whereas Pgfs responds to all NFS operations.
In essence, Pgfs is an NFS <-> SQL translator. When an NFS request comes in, the C code submits SQL to get the stat(2) structures for the directory and file mentioned in the request, doing error and permission checking as it goes along. First it compares the request with the data it gets back about the file, enforcing conditions, such as whether rmdir can or can't be used to delete a file.
If the request is valid and the permissions allow it, the C code finds all the stat(2) structures that must be changed, such as the current file, the current directory, the directory above and hard links that share the file's inode. Then these modifications are made in the database by SQL. The modifications include side effects like updating the access time that you might not ordinarily think of.
Each NFS operation is processed within a database transaction. If an “expected” error occurs that could be caused by bad user input on the NFS client, such as typing rmdir to delete a file, an NFS error is returned. If an “unexpected” error occurs, such as the database not responding or a file handle not found, the transaction is aborted in a way that will not pollute the file system with bad data.
Pgfs does all the things “by hand” that go on in a “real” file system. It uses PostGres as a storage device that it accesses by inode number, pathname and verset number. For an example, the nfs_getattr NFS operation works like the lstat(2) system call. getattr takes a file identifier, in this case an NFS handle instead of a pathname, and returns all the fields of a stat(2) structure. When Pgfs processes an nfs_getattr operation, the following things happen:
The NFS packet is broken apart into operation and arguments.
NFS operations counters are incremented.
The NFS handle is broken into fields.
Bounds-checking is done on the nfs_getattr parameters.
stat(2) information is gotten for handle, e.g., select * from tree where handle = 20934
Permissions are checked.
File access times are updated, e.g., update tree set atime = 843357663 where inode = 8923
NFS reply is constructed.
Reply is sent to NFS client
The single table that holds all the stat(2) structures has fields defined as shown in Table 1.
Inode numbers are unique across the entire database, even for identical files in different versets. Each file in each verset has one database row. Each directory has three rows; one for it's name from the directory above, one for . (dot), and one for .. (dot dot) from the directory below.
Philosophically, compression of similar file trees is the business of the back end of a program—it should not be visible to the user. In Pgfs, each collection of file bytes is contained in a Unix file, shared copy-on- write across all the versets from which the filename was inherited. Whenever a shared file is modified, a private copy is made for that verset. This matches Pgfs' system administration orientation, where files will be large and binary and replaced in total, and the old and new binaries won't be similar enough to make differences small. This differs from source code, where the same files get incrementally modified over and over and differences are small. With the keep-whole-files policy, doing a grep on files in multiple versets won't be slower than staying within a single verset. There is not a big delay while a compression algorithm unpacks intermediate versions into a temporary area.
|Speed Up Your Web Site with Varnish||Jun 19, 2013|
|Non-Linux FOSS: libnotify, OS X Style||Jun 18, 2013|
|Containers—Not Virtual Machines—Are the Future Cloud||Jun 17, 2013|
|Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer||Jun 12, 2013|
|Weechat, Irssi's Little Brother||Jun 11, 2013|
|One Tail Just Isn't Enough||Jun 07, 2013|
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Linux Systems Administrator
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Technical Support Rep
- Validate an E-Mail Address with PHP, the Right Way
- Senior Perl Developer
- UX Designer
- Speed Up Your Web Site with Varnish
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?