PVFS: A Parallel Virtual File System for Linux Clusters
Management dæmons, or managers, have two responsibilities: validating permission to access files and maintaining metadata on PVFS files. All of these tasks revolve around the access of metadata files. Only one management dæmon is needed to perform these operations for a file system and a single-management dæmon can manage multiple file systems. The manager is also responsible for maintaining the file system directory hierarchy. Applications running on compute nodes communicate with the manager when performing activities such as listing directory contents, opening files and removing files.
On the other hand, I/O dæmons serve the single purpose of accessing PVFS file data and correlating data transfer between themselves and applications. Direct connections are established between applications and I/O servers in order to directly exchange data during read and write operations.
There are several options for providing PVFS access to the client nodes. First, there is a shared, or static, library that can be used to interact with the file system using its native interface. This requires writing applications specifically to use functions such as pvfs_open, however. As an alternative, there are two access methods that provide transparent access. The preferred method is to use the PVFS kernel module, which allows full access through the Linux VFS mechanism. This loadable module allows the user to mount PVFS just like any other traditional file system. Another option is to use a set of C library wrappers that are provided with PVFS. These wrappers directly trap calls to functions such as open and close before they reach the kernel level. This provides higher performance but with disadvantages in that the compatibility is incomplete, and the wrappers work only with certain supported versions of glibc.
A final option is to use the MPI-IO interface, which is part of the MPI-2 standard for message passing in parallel applications. The MPI-IO interface for PVFS is provided through the ROMIO MPI-IO implementation (see Resources) and allows MPI applications to take advantage of the features of MPI-IO when accessing PVFS. It also ensures that the MPI code will be compatible with other ROMIO-supported parallel file systems.
The test system at Ericsson Montréal started as a cluster of seven diskless Pentium grade CPUs with 256MB of RAM each. These CPUs first boot using a minimal kernel written on flash using a tool provided by the manufacturer. They then they get their IP address and download a RAM disk from a Linux box acting as both a DHCP and a TFTP server. This same machine also acts as an NFS server for the CPUs, providing a shared disk space.
When we decided to experiment with PVFS, we needed some PCs with disks to act as I/O nodes and one PC to be the management node. We added one machine, PC1, to be the management node and three machines, PC2, PC3 and PC4, with a total disk space of 35GB, to be the I/O nodes. The new map of the cluster became:
Seven Diskless Client CPUs
One Management Node
Three I/O Nodes
While PVFS developers provide RPMs for all types of nodes, we chose to recompile the source in order to optimize installation on the diskless clients. This went over without a hitch using the PVFS tarball package. For the manager and I/O nodes, we used the relevant RPM packages. The manager and I/O nodes are using the Red Hat 6.2 distribution and the 2.2.14-5.0 kernel. The diskless CPUs run a customized minimal version of the 2.2.14-5.0 kernel.
The first step towards setting up the PVFS manager is to download the PVFS manager RPM package and install it. PVFS will be installed by default under /usr/pvfs. Once the automatic installation is done, it is necessary to create the configuration files. PVFS requires two configuration files in order to operate: “pvfsdir”, which describes the directory to PVFS and “iodtab”, which describes the location of I/O dæmons. These files are created by running the mkiodtab script (as root):
[root@pc1 /root]# /usr/pvfs/bin/mkiodtab
See Listing 1 for the iodtab setup for the Parallel Virtual File System. It will also make the .pvfsdir file in the root directory.
When we ran mkiodtab on the manager, PC1, it complained that it did not find the I/O nodes. It turned out to be that we had forgotten to include entries of my I/O nodes in /etc/hosts. We updated the /etc/hosts file and reran mkiodtab; everything went okay. mkiodtab created a file called “iodtab” under /pvfs. This file contained the list of my I/O nodes. It looked like the following:
------------/pvfs/.iodtab------------ pc2:7000 pc3:7000 pc4:7000 -------------------------------------
The default port number used by I/O dæmon software to allow clients to connect to it over the network is 7,000.
After running mkiodtab, we did the following to start PC1 as the PVFS manager:
Start the manager dæ: % /usr/pvfs/bin/mgr % /usr/pvfs/bin/enablemgr
Running enablemgr on the management node ensures that the next time the machine is booted the dæmons will be automatically started, so that it doesn't need to be started manually after rebooting. The enablemgr command only needs to be run once to set up the appropriate links.
Practical Task Scheduling Deployment
July 20, 2016 12:00 pm CDT
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.Register Now!
- SUSE LLC's SUSE Manager
- My +1 Sword of Productivity
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- Non-Linux FOSS: Caffeine!
- Managing Linux Using Puppet
- Doing for User Space What We Did for Kernel Space
- Tech Tip: Really Simple HTTP Server with Python
- SuperTuxKart 0.9.2 Released
- Parsing an RSS News Feed with a Bash Script
- Google's SwiftShader Released
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide