A Process for Managing and Customizing HPC Operating Systems

High-performance computing (HPC) for the past ten years has been dominated by thousands of Linux servers connected by a uniform networking infrastructure. The defining theme for an HPC cluster lies in the uniformity of the cluster. This uniformity is most important at the application level: communication between all systems in the cluster must be the same, the hardware must be the same, and the operating system must be the same. Any differences in any of these features must be presented as a choice to the user. The uniformity and consistency of running software on an HPC cluster is of utmost importance and separates HPC clusters from other Linux clusters.

The uniformity also persists over time. Upgrades and security fixes should never affect application correctness or performance. However, security concerns in HPC environments require updates to be applied in a timely fashion. These two requirements are conflicting and need to be managed by well documented processes that involve testing and regular outages.

A process for managing these requirements was developed at the Environmental Molecular Sciences Laboratory (EMSL) during the past ten years. EMSL supports HPC for the United States Department of Energy (DOE) and the open science community. This process gives EMSL an edge in maintaining a secure platform for large computational chemistry simulations that complement instrument research done at EMSL.


The process developed at EMSL to maintain HPC clusters has roots in standard software testing models. The process involves three phases: build testing, integration testing and production. These phases have their own requirements both in hardware, software and organization. Other important systems include configuration management, continuous monitoring and repository management. All of these systems have well defined roles to play in the overall process and need dedicated hardware, not part of the production cluster, to support them.

The build integration phase requires two primary components: package repository management and continuous integration software. These two components interact and give software developers and system administrators knowledge of bugs in individual pieces of software before those updates affect integration testing. This form of testing is important to automate for critical applications because it helps facilitate communication between operations and development teams.

The integration testing phase requires a test cluster that is close to matching the production cluster. The primary difference between the production and test clusters, for HPC, is scale. The test cluster should have a lower number, but at least one, of every Linux host in the production cluster, including configuration management and continuous monitoring. Furthermore, the Linux hosts should be as close to matching production configuration as possible. Any deviations between the production and test clusters' configuration, both in hardware and software, should be well documented. This document will help define the accepted technical risks that might be encountered during production outages.

The production cluster is the culmination of all the preparations done in the build and testing phases. Leading up to the outages, documented tasks during the outage should be identified along with planned operating system upgrades. Storing these documents should be easily accessible for both developers and management to see as well as easy for operational staff to modify and track issues. Along with the plan, documented processes for moving configuration management and continuous monitoring from testing to production also should be followed.

We have identified some required infrastructure needed to support and automate the process for managing your own Linux HPC operating system. During the build integration phase, a dedicated build system is needed along with package management and continuous integration software. The integration testing phase requires test cluster hardware and continuous monitoring and configuration management software. Finally, the production cluster also should integrate with configuration management and continuous monitoring software.

Several systems are not covered here but are critical to integrate into the process. Site-specific backup solutions should be considered for every phase of the process. Furthermore, automated provisioning systems also should be considered for use with this process. At EMSL we have used both, but it's certainly not required by the process; it just makes sleep better at night.

Build Phase

The build phase is the start of the process. There are three inputs into the system: binary packages, source code packages and tickets. These three inputs produce three outputs: a set of base repositories, a set of patches for upstream contribution and an overlay repository of modified packages. These inputs and outputs provide the operating system fixes needed for your site while contributing them back to the communities that support them. To understand this process completely, let me to break down the components and talk about their requirements.

The package repository management system is utilized throughout the process but first appears in the build phase. The package repository management system should be able to download binary package repositories from an upstream distribution. It also should be able to keep those downloaded repositories in sync with the upstream distribution. The first set of repositories should be a local copy of the upstream distribution, including updates, synchronized daily. As an added feature, the package repository management system also should be able to remove certain packages selectively from being downloaded. This feature complements the contents of the overlay repository. The overlay repository is the place where custom builds of the packages get put to enhance the base distribution.

The content of the overlay repository is specific to the critical applications in the distribution that need to be managed separately. For example, HPC sites might be more concerned about the kernel build, openfabrics enterprise distribution (OFED) and software that implements the message passing interface (MPI). This software is then removed from the base distribution and added back in an overlay repository. Furthermore, there can be multiple overlay repositories. For example, security concerns may dictate that the kernel needs to be managed separately from the rest of the distribution. Having the kernel in a separate overlay repository means that the testing phase can be skipped with minimal impact and still maintain a secure cluster.

The packages that are in the overlay repository are patched to match the needs of the organization. The continuous integration system should be used to patch the specific packages and maintain the build with future updates. These patches should be issued back to the upstream distribution along with good reasons why this patch was needed. Some of these patches may get accepted by upstream developers and make it into the distribution while others may take years to make it due to policy decisions on the part of the distribution maintainers.

Another job of the continuous integration system is to support the continuous build and testing of additions to the distribution that are not supported. These additions may be site-specific applications or open-source software not supported by the distribution. Many open-source software projects support compatibility with enterprise distributions but do not seek distribution inclusion because of financial project support reasons.

The final piece to the build phase is the ticket-tracking system. This system provides package developers information into what needs to be fixed and how. These tickets may come directly from users or from cluster administrators. Furthermore, the users and cluster administrators may use completely different ticketing systems. This piece of the process helps facilitate communication between groups. Having a list of tickets allows objective discussion about priority and makes sure tickets are not forgotten. Tickets may stay open for years or days, depending on priority and rate of ticket creation. The tickets do not stop with the package developers; the cluster administrators use this system in further phases.

The package management and continuous integration systems are automated processes, while the ticket-tracking system requires human interaction. These systems can be deployed on a single host. However, there is a requirement that three copies of the package repositories be present for the later phases of the process. Furthermore, there are features of the continuous integration system that integrate with the ticket-tracking systems. Enabling this feature does require a certain level of stability in the continuous integration build process. Many of the specifics in these systems are not covered here and will be covered later.


David Brown is a high-performance computing system administrator with a B.S. in Computer Science from Washington State University.