Pgfs: The PostGres File System
Here's a description of the real Pgfs program that you can download. Pgfs is a normal user-level program that reads and writes ordinary TCP streams and UDP packets. Since it is a normal program that requires no privileges, it can run on any Linux system. It doesn't use any ground breaking system call features, so no kernel modifications are necessary. The TCP stream packets are generated by the PostGres client library, so Pgfs can interact with a PostGres database using SQL. The UDP packets are formatted by the conventions of the NFS protocol. All this means is that an NFS client such as a Linux kernel can choose to send NFS packets Pgfs' way, and can mount a file system as if Pgfs were any other variety of NFS server. The AMD automounter is another example of a user-level program that acts as an NFS server. AMD responds to the directory-browsing NFS operations that trigger an automounter response, whereas Pgfs responds to all NFS operations.
In essence, Pgfs is an NFS <-> SQL translator. When an NFS request comes in, the C code submits SQL to get the stat(2) structures for the directory and file mentioned in the request, doing error and permission checking as it goes along. First it compares the request with the data it gets back about the file, enforcing conditions, such as whether rmdir can or can't be used to delete a file.
If the request is valid and the permissions allow it, the C code finds all the stat(2) structures that must be changed, such as the current file, the current directory, the directory above and hard links that share the file's inode. Then these modifications are made in the database by SQL. The modifications include side effects like updating the access time that you might not ordinarily think of.
Each NFS operation is processed within a database transaction. If an “expected” error occurs that could be caused by bad user input on the NFS client, such as typing rmdir to delete a file, an NFS error is returned. If an “unexpected” error occurs, such as the database not responding or a file handle not found, the transaction is aborted in a way that will not pollute the file system with bad data.
Pgfs does all the things “by hand” that go on in a “real” file system. It uses PostGres as a storage device that it accesses by inode number, pathname and verset number. For an example, the nfs_getattr NFS operation works like the lstat(2) system call. getattr takes a file identifier, in this case an NFS handle instead of a pathname, and returns all the fields of a stat(2) structure. When Pgfs processes an nfs_getattr operation, the following things happen:
The NFS packet is broken apart into operation and arguments.
NFS operations counters are incremented.
The NFS handle is broken into fields.
Bounds-checking is done on the nfs_getattr parameters.
stat(2) information is gotten for handle, e.g., select * from tree where handle = 20934
Permissions are checked.
File access times are updated, e.g., update tree set atime = 843357663 where inode = 8923
NFS reply is constructed.
Reply is sent to NFS client
The single table that holds all the stat(2) structures has fields defined as shown in Table 1.
Inode numbers are unique across the entire database, even for identical files in different versets. Each file in each verset has one database row. Each directory has three rows; one for it's name from the directory above, one for . (dot), and one for .. (dot dot) from the directory below.
Philosophically, compression of similar file trees is the business of the back end of a program—it should not be visible to the user. In Pgfs, each collection of file bytes is contained in a Unix file, shared copy-on- write across all the versets from which the filename was inherited. Whenever a shared file is modified, a private copy is made for that verset. This matches Pgfs' system administration orientation, where files will be large and binary and replaced in total, and the old and new binaries won't be similar enough to make differences small. This differs from source code, where the same files get incrementally modified over and over and differences are small. With the keep-whole-files policy, doing a grep on files in multiple versets won't be slower than staying within a single verset. There is not a big delay while a compression algorithm unpacks intermediate versions into a temporary area.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- RSS Feeds
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Designing Electronics with Linux
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- A Topic for Discussion - Open Source Feature-Richness?
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Validate an E-Mail Address with PHP, the Right Way
- What's the tweeting protocol?
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




4 hours 51 min ago
9 hours 18 min ago
12 hours 53 min ago
13 hours 26 min ago
15 hours 49 min ago
15 hours 52 min ago
15 hours 54 min ago
20 hours 19 min ago
22 hours 10 min ago
1 day 3 hours ago