A Process for Managing and Customizing HPC Operating Systems

Test Phase

The integration testing phase requires the package repository management, continuous monitoring and configuration management systems. These three systems help maintain the test cluster in a state that integration testing be done by some automated processes. Furthermore, the test cluster hardware configuration should represent all critical aspects of the production cluster such that it mitigates risk to production clusters.

The package repository management system does play a role in all three phases of the process. This is the first phase of the process where the packages with additions are tested in production-like configurations. The daily repositories, including the overlay repository, are synchronized to a set of testing repositories to be included in the test cluster. This synchronization ensures a consistent environment to perform tests.

Every time updates are synchronized to the testing repositories, a set of integration tests should be performed on the test cluster. These tests should be designed to simulate the usage of the production cluster. It's important to focus the tests on critical user-level applications and parts of the operating system you have replaced and put into the overlay repository. The continuous integration system can run these tests and alert on failures.

Failures in the integration tests should be reported in the ticket-tracking system. This is one of the paths to complete the circle of development. Other tickets include deployment and re-install issues. Complex internal infrastructure in the production cluster also may present upgrade issues, and those issues also should be tracked. The test cluster also should be managed by the same procedures as the production cluster. The procedures should be practiced on the test cluster to minimize tickets before updates get deployed to the production cluster. All of these tasks should be performed in repetition until the addition of new tickets is reduced to minimize the risk to the production cluster.

The configuration management and continuous monitoring systems are set up in a similar way between the test and production environments. These two systems help maintain the production state from inadvertent hardware or software changes and, thus, need to be tested when deliberate changes are made. These changes need to be integrated into the production environment easily. So, maintaining the configuration for these two systems in a source code management repository that supports branching and merging also is prudent. This allows for standard processes for making changes and pushing those changes between testing and production environments.

When the number of tickets have been reduced and it is time to push the changes to the production cluster, five things come out of this phase: an updated set of package repositories, a set of tasks that need to be done during an outage, updated procedures to be used on the production cluster, changes that need to be merged to production for the configuration management and continuous monitoring systems. Both the package developers and the cluster administrators need to collaborate on the procedural changes and the outage tasks. This collaboration works well in a wiki environment that is internal to both groups. These outputs conclude the integration testing phase of the process.

Production Phase

The production phase of the process takes the results from the integration testing phase and applies them to the production cluster. This phase utilizes all of the same processes as the testing phase, with a few modifications. Furthermore, this is the phase where users get to affect change in the process. There is also an increase in more formal communication methods through software between groups. The final outputs of this phase feed back to help complete the development and testing cycle as well. After this phase is completed, the process is finished, and the updated production environment will be maintainable.

The first part of this phase is the replication of what was done with the package repositories. However, this phase requires that production copies of the repositories be synchronized from the integration testing repositories. This is the final set of package repositories required by the process. Furthermore, production configuration of the continuous monitoring and configuration management system also should be created from the integration testing configuration of the respective systems.

Users of the production cluster get input into the process during this phase. Depending on the users' requirements, this may be a different instance of the ticket-tracking system or the same one as used by the package developers and cluster administrators. Either way, it's important to track this input so it makes it through the process without getting forgotten.

Communication is key to this part of the process. From the testing phase, we know what tasks need to be completed on the production cluster during an outage and how long they should be expected to take. This helps management determine cost and benefit of the outage to determine a path forward. There is also continued communication required during outages when differences between the production and test clusters bring unexpected issues. These issues should be mitigated quickly, but tickets should be issued to ensure proper resolution of the issue so it never happens again.

There is always an importance of being prepared for production cluster outages. However, it is impossible to be completely prepared for every possible contingency. The differences between the test cluster and production cluster configuration will help to define the highest risks to any particular outage. It is critical to communicate these risks and any changes that might be impacted by those risks to management prior to outages.

Table 1. Components and Examples of Open-Source Software That Would Meet the Requirements within the Process

Component Open-Source Software
Continuous Monitoring Nagios, Simple Event Correlator, Auditd
Package Repository Management Cobbler
Continuous Integration Jenkins, Hudson
Ticket Tracking Trac, Bugzilla
Wiki Documentation Trac, Drupal WordPress
Provisioning Cobbler
Configuration Management Cfengine, Chef, Salt, Puppet
Backup Software Bacula

Conclusion

The process described here does seem like a lot of overhead, and it may seem not applicable to your situation. However, the process does have specific circumstances where the testing phase can be skipped. Furthermore, this process is generic enough to be scaled to your needs. There are many pieces of open-source or proprietary software that can meet the requirements of this process.

Skipping the testing phase process easily can be done by pulling the critical applications into separate overlay repositories so they can be managed separately. Then make sure the process for getting updates is put into the continuous integration system. This may just be a Web site download that pulls the appropriate software into its respective overlay repository. Then simply synchronize the overlay repository to test and then production. This is done immediately to push security updates to production systems.

Similarly with production configuration changes, many times unexpected issues during an outage demand that configuration changes be made directly to production systems. These changes should be able to be made directly in the production configuration management system then merged back to test when the outage is over. If the changes to production need development to be made more generic, this should happen in the build and integration testing phase. The final solution then should be pushed to production during an outage.

In conclusion, the process described here is simply suggestive in nature. If the process needs to be modified to get things working again, do so. However, after fixing issues, keep in mind the part of the process the issue relates to and inject what has been done into the process. This process is generic and flexible to manage these sorts of changes as well as keep systems updated while managing communication through well defined systems.

______________________

David Brown is a high-performance computing system administrator with a B.S. in Computer Science from Washington State University.

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix