Building a Bioinformatics Supercomputing Cluster

 in
Bioinformatics tools running in the OSCAR cluster environment turned 17 recycled PCs into a system that improves performance for user queries.

Bioinformatics is an increasingly important scientific discipline that involves the analysis of DNA and protein sequences. The Basic Local Alignment Search Tool (BLAST) was developed by The National Center for Biotechnology Information (NCBI) to aid scientists in the analysis of these sequences. A public version of this tool is available on the Web or by download. Because the BLAST Web site is such a popular tool, its performance can be inconsistent at best. The University of South Dakota (USD) Computer Science Bioinformatics Group decided to implement a parallel version of the BLAST tool on a Linux cluster by combining freely available software. The BLAST cluster, comprised of old desktop PCs destined for surplus, improves searches by providing up-to-date databases to a smaller audience of researchers.

Our cluster project began with an implementation of the Open Source Cluster Application Resources (OSCAR). OSCAR was developed by the Open Cluster Group to improve cluster computing by providing all the necessary software to create a Linux cluster in one package. OSCAR helps automate the installation, maintenance and even the use of cluster software. A graphical user interface provides a step-by-step installation guide and doubles as a graphical maintenance tool.

WWW BLAST was created by NCBI to offer a Web-based front end for BLAST users and is the Web interface we selected for our BLAST cluster. WWW BLAST can be installed easily on a Linux machine running a Web server such as Apache.

Although WWW BLAST enhances the usability of our cluster, mpiBLAST enhances the performance. mpiBLAST was developed by Los Alamos National Laboratories (LANL) to improve the performance of BLAST by executing queries in parallel. mpiBLAST is based on the Message Passing Interface (MPI), a common software tool for developing parallel programs. mpiBLAST provides all of the software necessary for parallel BLAST queries.

Overview of a Query

A Web-based query form marks the beginning of a BLAST search on our cluster. By default, WWW BLAST does not support batch processing and job scheduling. Fortunately, OpenPBS and Maui are provided by the OSCAR software suite to handle job scheduling and load balancing. With this support, the cluster can handle more easily a larger user audience. OpenPBS is a flexible batch queuing system originally developed for NASA. Maui extends the capabilities of OpenPBS by allowing more extensive job control and scheduling policies.

Once the user submits the query, a Perl script provided by WWW BLAST is invoked. This script creates a unique job based on parameters from the query form. A job is a program or task submitted to OpenPBS for execution. Once the job has been submitted, OpenPBS determines node availability and executes the job based on scheduling policies. This job starts the local area multicomputer (LAM) software, which is a user-level, dæmon-based run-time environment. LAM is available as part of the OSCAR installation and provides many of the services required by MPI programs. OpenPBS executes the job by utilizing the mpirun command, which executes the query on each node and gathers the results. WWW BLAST passes these results back to the browser, presenting the user with a human-friendly report (Figure 1).

Figure 1. When a query comes in from the Web, WWW BLAST submits a job to OpenPBS. OpenPBS starts the job with mpirun and WWW BLAST formats the results.

Implementing cluster technology to perform parallel BLAST searches requires some software reconfiguration. Many of the tools we use work with default installations, but a parallel BLAST cluster requires extra configuration to get things running.

Using OSCAR to Build a Cluster

Clusters may be made up of a variety of PCs. The 17 nodes we used had 533MHz Intel Celeron processors, 256MB of RAM and 15GB of hard disk space—relatively low-end by today's standards. Using the exact same hardware setup for all of the nodes is not vital for cluster set up, but doing so does reduce the time and effort needed to install and maintain your cluster. Once all of the hardware is ready, you must choose a machine to be the head node. If you are not using identical machines, it would be beneficial to use the most powerful as the head node. Because all of the PCs we used have the exact same hardware configuration, the choice of the head node was arbitrary.

After you have obtained all the necessary PC hardware, you need to choose a Linux distribution. The OSCAR documentation lists all of its supported distributions, and Red Hat 9.0 was our distribution of choice. Installing Red Hat was pretty straightforward; we chose the default options. Because the OSCAR software depends on specific versions of OS packages, you should not install any updates once the installation completes. Of course, this has many security implications, which is why it is important to keep your cluster separated from the Internet by a firewall.

Once Red Hat was installed on the head PC, we downloaded the OSCAR 2.3.1 tarball. See the on-line Resources for installation documentation. We downloaded OSCAR into root's home directory, because OSCAR needs to be installed as root. Installing the OSCAR software was as easy as running the following commands:

tar -xvfz oscar-2.3.1.tar.gz
cd oscar-2.3.1
./configure
make install

After the installation completed, we needed to copy all of the Red Hat 9.0 RPMs to /tftpboot/rpm on our head PC. The OSCAR installation needs to install certain packages from this directory during its installation. We used the following command to copy the files:

cp /mnt/cdrom/RedHat/RPMS/*.rpm /tftpboot/rpm

Once all of the RPMs are copied, the OSCAR installation can begin. OSCAR provides a graphical installation wizard for the installation. Substitute the name of your private network Ethernet adapter; ours was eth1:

cd $OSCAR_HOME ./install_cluster eth1

After a few moments, the OSCAR installation wizard begins to load. This wizard provides a graphical user interface and an intuitive eight-step process to complete the cluster setup (Figure 2). The only variation in our installation procedure was to set the default MPI implementation to LAM/MPI instead of MPICH. We chose LAM because it is needed for mpiBLAST to execute properly.

Figure 2. The OSCAR installation wizard lets you deploy, configure and test cluster software.

Clicking on step 2, Configure Selected OSCAR Packages, displays a small dialog (Figure 3). From there you can select the Environment Switcher button and choose LAM as the default for the installation (Figure 4).

Figure 3. Click on Config... to change environment to LAM.

Figure 4. Select LAM for default environment.

We followed the remaining steps as described in the OSCAR documentation to build and install a disk image for the nodes. Once all of the nodes were installed and tested, we downloaded and installed mpiBLAST.

______________________

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState