Building a Bioinformatics Supercomputing Cluster

Bioinformatics tools running in the OSCAR cluster environment turned 17 recycled PCs into a system that improves performance for user queries.
Installing mpiBLAST for Parallel Searches

We downloaded mpiBLAST and installed it according to the documentation provided in the README file. We created symbolic links for mpiblast and mpirun in our $PATH, and no further configuration of mpiBLAST was necessary.

Once mpiBLAST was installed, we needed to download a database to search. For mpiBLAST to execute properly, the database needs to be in the FASTA format. NCBI offers an index for all of its databases on the NCBI Web site, and that index lists a FASTA subdirectory containing all of the databases in FASTA format. We downloaded a copy of the nr database to /usr/local/mpiBLAST/db/, an NFS-shared folder set up during the installation of mpiBLAST. mpiBLAST provides the mpiformatdb command, which formats the database into segments; the number of segments depends upon the number of nodes in the cluster. mpiformatdb places the segments it creates into a shared directory. This directory is defined in mpiblast.conf during installation and is utilized by all mpiBLAST programs. Here is an example of formatting the database:

# /usr/local/mpiBLAST/bin/mpiformatdb -N 16 -i nr

Here, -N specifies the number of database segments—usually the number of nodes in the cluster—and -i specifies the name of the database file to format. In this example, the nr database is formatted into 16 individual segments. mpiformatdb does not copy the segments to the nodes, so a significant amount of overhead is incurred while each node copies its database segment during the first query. Each node copies a segment only once. If the segment is erased from the node, it is copied again during the next query.

To simplify management of the cluster, we wrote a script to download the newest version of a database, format it with mpiformatdb and distribute it to the nodes by executing a simple BLAST query. We scheduled this script with cron to run on a weekly basis. Once we were able to execute BLAST queries in parallel, we added the Web-based front end from WWW BLAST.


mpiBLAST provides command-line BLAST searches and includes two files for interaction with a Web-based front end, blast.cgi and These files are configured to work with WWW BLAST. So our next step was to download WWW BLAST into the /var/www directory, creating the /var/www/blast/ directory. Several configuration changes had to take place for WWW BLAST to submit BLAST searches for parallel execution.

WWW BLAST provides its own directory for databases. Because we are using mpiBLAST to format the databases, we had to point WWW BLAST's db/ directory to mpiBLAST's. We then made the db/ directory in blast/ a symbolic link to the db/ directory for mpiBLAST.

WWW BLAST provides a file called blast.cgi that executes a BLAST query. mpiBLAST provides a replacement blast.cgi that executes a parallel BLAST query by way of is a Perl script that creates a query for mpiBLAST to execute. creates this query in the form of another Perl script, populating it with the parameters from the Web form. This script is submitted to OpenPBS. serves several functions, including parsing the parameters of the form, creating a script to be submitted to the cluster through OpenPBS for job queuing and load balancing and returning the BLAST search results in a browser-friendly format.

We needed to make some changes to, however, to allow it to operate correctly in our environment. The first change that we made was to the global variables $scratch_space and $MPIBLASTCONF. These two variables are used throughout the life of the script. $scratch_space holds the absolute path to a directory containing temporary files used during a query. $MPIBLASTCONF holds the absolute path to the directory containing the mpiBLAST configuration file. Both of these directories were set up during the installation of mpiBLAST. We set the two variables as follows:


The next change involved changes to a series of if statements. These statements hard-code the NUMPROC environment variables for the nt, nr and pdb databases. Because the databases need to be preformatted by mpiBLAST, the number of processors used per query is constant. We changed the default number of 20 to 16, which is the number of processors we use:

if($data{'DATALIB'} eq "nt"){
    $data{'NUMPROC'} = 16;

Further down in the script, the ValidateFormData subroutine is defined. This subroutine ensures that the user has selected a valid database/program combination and produces a 500 server error if a valid combination is not selected. We changed the subroutine to allow the tblastx program to execute queries on the nr database by making the following change:

#### BEFORE ####
# Must be applied to a nucleotide database
if($data_ref->{'DATALIB'} ne "nt"){

#### AFTER ####
# Must be applied to a nucleotide database
if($data_ref->{'DATALIB'} ne "nt" ||
   $data_ref->{'DATALIB'} ne "nr"){

Later on, the script creates a string of command-line arguments for mpiBLAST and stores them in the variable $c_line. We needed to change the value passed to the -d option, which tells mpiBLAST which database to search. By default, concatenates the number of processors to the database name and passes the result to the -d option. So if our database was named nr and we had 16 processors, it would pass nr16. Presumably this is done to allow more than one version of a database to be searched, that is, nr16 for a 16-segment database and nr8 for an 8-segment database. You either can name your databases in that manner or modify the script. Because we only ever have one version of a database, we chose to modify the script, removing the number of processors from the database name. The code changes are summarized below:

#### BEFORE ####
# Create the command line to pass to mpiBlast my
$c_line = "-d $data_ref->{'DATALIB'}" .
          "$data_ref->{'NUMPROC'} " .
          "-p $data_ref->{'PROGRAM'} " .

#### AFTER ####
# Create the command line to pass to mpiBlast my
$c_line = "-d $data_ref->{'DATALIB'} " .
          "-p $data_ref->{'PROGRAM'} " .

When running test queries, we received several lcl|tmpseq_0: Unable to open BLOSUM62 warnings in the OpenPBS error log. Pointing the environment variable BLASTMAT to the location of the BLAST matrices clears up these warnings, so we made the following change:

#### BEFORE ####
print SCRIPTFILE '#PBS -e '.
print SCRIPTFILE 'if(-e $ENV{PBS_NODEFILE} ){'."\n";

#### AFTER ####
print SCRIPTFILE '#PBS -e '.
print SCRIPTFILE 'if(-e $ENV{PBS_NODEFILE} ){'."\n";

We encountered the final alteration toward the end of the script in the HtmlResults subroutine. The code that directs the user to the results uses a default base URL, which almost certainly is not what you want. Changing the base URL to point to our Web server allowed the client's Web browser to display the results of the BLAST query:

#### BEFORE ####
print "Location:".

#### AFTER ####
print "Location: http://domain_name/BlastResults".


Geek Guide
The DevOps Toolbox

Tools and Technologies for Scale and Reliability
by Linux Journal Editor Bill Childers

Get your free copy today

Sponsored by IBM

Upcoming Webinar
8 Signs You're Beyond Cron

Scheduling Crontabs With an Enterprise Scheduler
11am CDT, April 29th
Moderated by Linux Journal Contributor Mike Diehl

Sign up now

Sponsored by Skybot