PVFS: A Parallel Virtual File System for Linux Clusters
Management dæmons, or managers, have two responsibilities: validating permission to access files and maintaining metadata on PVFS files. All of these tasks revolve around the access of metadata files. Only one management dæmon is needed to perform these operations for a file system and a single-management dæmon can manage multiple file systems. The manager is also responsible for maintaining the file system directory hierarchy. Applications running on compute nodes communicate with the manager when performing activities such as listing directory contents, opening files and removing files.
On the other hand, I/O dæmons serve the single purpose of accessing PVFS file data and correlating data transfer between themselves and applications. Direct connections are established between applications and I/O servers in order to directly exchange data during read and write operations.
There are several options for providing PVFS access to the client nodes. First, there is a shared, or static, library that can be used to interact with the file system using its native interface. This requires writing applications specifically to use functions such as pvfs_open, however. As an alternative, there are two access methods that provide transparent access. The preferred method is to use the PVFS kernel module, which allows full access through the Linux VFS mechanism. This loadable module allows the user to mount PVFS just like any other traditional file system. Another option is to use a set of C library wrappers that are provided with PVFS. These wrappers directly trap calls to functions such as open and close before they reach the kernel level. This provides higher performance but with disadvantages in that the compatibility is incomplete, and the wrappers work only with certain supported versions of glibc.
A final option is to use the MPI-IO interface, which is part of the MPI-2 standard for message passing in parallel applications. The MPI-IO interface for PVFS is provided through the ROMIO MPI-IO implementation (see Resources) and allows MPI applications to take advantage of the features of MPI-IO when accessing PVFS. It also ensures that the MPI code will be compatible with other ROMIO-supported parallel file systems.
The test system at Ericsson Montréal started as a cluster of seven diskless Pentium grade CPUs with 256MB of RAM each. These CPUs first boot using a minimal kernel written on flash using a tool provided by the manufacturer. They then they get their IP address and download a RAM disk from a Linux box acting as both a DHCP and a TFTP server. This same machine also acts as an NFS server for the CPUs, providing a shared disk space.
When we decided to experiment with PVFS, we needed some PCs with disks to act as I/O nodes and one PC to be the management node. We added one machine, PC1, to be the management node and three machines, PC2, PC3 and PC4, with a total disk space of 35GB, to be the I/O nodes. The new map of the cluster became:
Seven Diskless Client CPUs
One Management Node
Three I/O Nodes
While PVFS developers provide RPMs for all types of nodes, we chose to recompile the source in order to optimize installation on the diskless clients. This went over without a hitch using the PVFS tarball package. For the manager and I/O nodes, we used the relevant RPM packages. The manager and I/O nodes are using the Red Hat 6.2 distribution and the 2.2.14-5.0 kernel. The diskless CPUs run a customized minimal version of the 2.2.14-5.0 kernel.
The first step towards setting up the PVFS manager is to download the PVFS manager RPM package and install it. PVFS will be installed by default under /usr/pvfs. Once the automatic installation is done, it is necessary to create the configuration files. PVFS requires two configuration files in order to operate: “pvfsdir”, which describes the directory to PVFS and “iodtab”, which describes the location of I/O dæmons. These files are created by running the mkiodtab script (as root):
[root@pc1 /root]# /usr/pvfs/bin/mkiodtab
See Listing 1 for the iodtab setup for the Parallel Virtual File System. It will also make the .pvfsdir file in the root directory.
When we ran mkiodtab on the manager, PC1, it complained that it did not find the I/O nodes. It turned out to be that we had forgotten to include entries of my I/O nodes in /etc/hosts. We updated the /etc/hosts file and reran mkiodtab; everything went okay. mkiodtab created a file called “iodtab” under /pvfs. This file contained the list of my I/O nodes. It looked like the following:
------------/pvfs/.iodtab------------ pc2:7000 pc3:7000 pc4:7000 -------------------------------------
The default port number used by I/O dæmon software to allow clients to connect to it over the network is 7,000.
After running mkiodtab, we did the following to start PC1 as the PVFS manager:
Start the manager dæ: % /usr/pvfs/bin/mgr % /usr/pvfs/bin/enablemgr
Running enablemgr on the management node ensures that the next time the machine is booted the dæmons will be automatically started, so that it doesn't need to be started manually after rebooting. The enablemgr command only needs to be run once to set up the appropriate links.
Practical books for the most technical people on the planet. Newly available books include:
- Agile Product Development by Ted Schmidt
- Improve Business Processes with an Enterprise Job Scheduler by Mike Diehl
- Finding Your Way: Mapping Your Network to Improve Manageability by Bill Childers
- DIY Commerce Site by Reven Lerner
Plus many more.
- Server Hardening
- Unikernels, Docker, and Why You Should Care
- diff -u: What's New in Kernel Development
- 22 Years of Linux Journal on One DVD - Now Available
- Controversy at the Linux Foundation
- Giving Silos Their Due
- Non-Linux FOSS: Snk
- Don't Burn Your Android Yet
- What's New in 3D Printing, Part III: the Software