OSCAR and Bioinformatics

With new cluster deployment and management tools, you can set up a 64-node cluster in an hour, then put it to work on your research.
New Features in OSCAR 4.0

We are planning to include various features in this upcoming release. They are partitioned into four main categories: NEST, node groups, Linux distributions and SIS.

NEST (Node Event and Synchronization Tools) is used to ensure that OSCAR packages are in sync with their centrally stored configurations across all the cluster nodes. Currently, when you install a new cluster node, post_install scripts for OSCAR and all packages have to be run on all the cluster nodes regardless of whether they need to. Although this model has worked on a medium scale, clearly a scalability limitation is imposed by it. The biggest change with NEST is that package configuration will be pulled from the server as opposed to being pushed to the clients. Operations will be executed only if they are necessary, which is a more elegant way than the brute-force execution scheme we are employing presently.

Node groups are arbitrary groupings of nodes in the cluster. With this new feature, it is possible to install and manage OSCAR packages selectively for these groups. In the upcoming release, we plan to support only the server and client node groups. In the future, however, users will be allowed to define their own groups.

One of the key features that defines OSCAR is our support for several Linux distributions. With the new release we hope to introduce support for Fedora Core 2 and 3, Red Hat Enterprise Server 3.0 and Mandrake 10. Supporting these distributions will also introduce support for IA-64 and x86-64 architectures.

System Installation Suite (SIS), which includes SystemImager, is the collection of programs that performs the image deployment of an OSCAR system. There are two main SIS-related improvements. First is the disk type autodetection. Traditionally, OSCAR cluster images are created with one type of hard disk in mind (either IDE or SCSI). With this OSCAR-specific patch, you can use the same image to deploy on different machines with different types of hard disks, as long as the base hardware is similar.

Second, a tool is available so users can use specific kernel modules for booting the nodes for imaging. Sometimes it is difficult to get newer hardware to work with OSCAR, because the SIS kernel boot image does not have the supported drivers. With this tool, you can use an existing kernel with known working modules as the SIS boot kernel and use that to boot up your client nodes so they can be imaged. This feature will be included in the next SystemImager development release, which we hope to include in OSCAR 4.0.

Creating Packages for OSCAR

OSCAR supplies packages for commonly used cluster-aware applications. They are simply RPM packages with corresponding metafiles and installation scripts. These packages are created and maintained by the OSCAR core team and package authors. If there is an application you would like to install on your cluster, but did not find it available from OPD (OSCAR Package Downloader), please create a package for it. The OSCAR team is open to contributions and possibly even hosting an OSCAR software package you might create. The team extends this invitation to software developers too.

OSCAR packages reside in package repositories, which are decentralized Web spaces provided by the package authors to host the package files. The URIs of these repositories are stored in a master repository list.

Creating OSCAR packages is generally a straightforward process. If an RPM is readily available, you are already halfway done. What remains is to create some files to store metadata about the package/RPMs and also scripts to propagate configuration files for the entire cluster. This is relatively easy to do because certain assumptions can be made about an OSCAR cluster.

If, however, an RPM package is not currently available, you need to package the application in RPM format before continuing. Creating an RPM may or may not be easy, depending on the complexity of the application. You need to create a spec file and build the RPM(s) and corresponding SRPM with the source tarball.

The first OSCAR package that the Genome Sciences Centre (GSC) has put out is for Ganglia, a cluster monitoring system. We have started working on our second package, which is for Sun Grid Engine, an open-source batch scheduling system sponsored by Sun Microsystems. This should be available from OPD at the GSC OSCAR Repository at a later date.

Bioinformatics Applications and OSCAR Cluster

The most commonly used bioinformatics program is BLAST, a sequence alignment/search tool. Querying genetic databases with multiple gene sequences is highly amenable to parallelization; it is very easy to split this problem into many sub-jobs, with each sub-job querying the database with one set of input. Solutions exist that perform such database/query splitting for you, most notably the open-source mpiBLAST as well as the commercial version from Paracel. Parallelized versions of BLAST scale quite well but, of course, reach a point where adding more nodes simply does not increase performance due to the overhead of starting separate and smaller jobs on multiple nodes.

FPC is another application we use a great deal and is used for assembling, editing and viewing fingerprint-based physical maps. The original parallelized version of this application was developed at the GSC, but it was not batch system-aware. We recently have integrated parallel FPC with Sun Grid Engine so that users can easily request a specific number of nodes to run this application.

We also are developing a peer-to-peer application called Chinook for sharing bioinformatics services. Currently, it is still under development, and we are working on integrating it with our cluster. This potentially could link up grids in the future and provide an alternative to the Globus toolkit.

Currently, our 200-node cluster is being heavily used to discover and classify regulatory elements in the human genome. The human genome consists of roughly 30,000 genes, and we are employing different algorithms to scan each gene. Genes are grouped into batches of 1,500, and a typical run would take about four days on an Opteron 2.0GHz machine. Without cluster technology, this kind of research would not be possible.