Improving Application Performance on HPC Systems with Process Synchronization
One would expect that doubling the processing power available to an application would double the application performance or cut the run time in half. Unfortunately, HPC users know this is far from true, with actual performance efficiency dropping to as low as only 5% of a system's theoretical peak performance. HPC researchers and application developers have spent and continue to spend much effort trying to find the source of these performance losses and boost sustained application performance. When we set about developing the Cray XD1 system, we joined the ranks of researchers attacking this problem. This article describes how we learned from those who went before us and how we built on that knowledge to develop a new Linux scheduling-based solution that promises to improve real application performance significantly on Linux HPC systems.
The majority of research has focused on the structure of the HPC applications themselves. Various research teams attempted to improve the efficacy of caching, looked for ways to minimize interprocessor communications and explored a variety of similar measures, but each strategy offered performance gains of only a few percent. Another area of research has shown particular promise, however. By understanding the interaction between the HPC application and the system background processes, one can find ways to modify this interaction to improve performance.
In a seminal paper documenting their research into this interaction, researchers Petrini, Kerbyson and Paking of the Los Alamos National Labs (see the on-line Resources) quantified the loss in application performance caused by what they deemed “noise”—the interaction between large multiprocess MPI jobs and background processes. They observed that housekeeping tasks, or noise, delayed individual processors from reaching MPI barriers (synchronization points in the application) and caused all other processors to wait while one processor finished up its housekeeping. This resulted in wasted cycles on all other processors.
The top half of Figure 1 illustrates this interaction and how it results in lost performance. The processes illustrated are part of a parallel job, each running on a separate processor and periodically synchronizing through the use of MPI barriers. In the first part of the computation, Process 1 is delayed because the node's scheduler pauses process execution to run background processes, such as those found on every Linux or UNIX node. Processes 2 and 3 also are delayed. Repetition of this pattern results in substantially reduced sustained application performance. The magnitude of the impact is a function of the frequency with which barriers are encountered and the number of processors.
Petrini and colleagues quantified this loss of performance running SAGE, a compressible Eulerian hydrodynamics code, on their HPC system, named ASCI Q. ASCI Q is a cluster of 2,048 HP ES45 nodes, where each node is a four-way SMP. Petrini, et al., observed that better performance was obtained when they restricted SAGE to run on only three of the four processors in the SMP when more than 256 nodes were utilized. They theorized that this result was caused by background noise, and the theory was verified by eliminating many of the sources of noise and observing the improvement in performance.
This research points to lack of process synchronization and wait time as the culprit that is robbing fine-grained and highly parallel HPC applications of up to 50% (and perhaps more) of their potential performance. Unfortunately, a means to stop this thievery still was not at hand. The method employed by Petrini, et al., to identify the culprit—restricting the system's freedom to run housekeeping tasks—doesn't present a practical solution for most HPC applications. The prospect of relegating one-quarter of the processors on an HPC system to running housekeeping tasks is not palatable to many HPC sites. In addition, many background processes cannot be removed, limiting the performance gain achievable using this approach.
When we set out to build a new high-performance computer, we also set out to find a way to prevent this performance theft. We considered a new approach using the Linux scheduler to synchronize scheduling of MPI jobs and housekeeping tasks. Previous work and our research suggest that this new synchronized scheduling approach can deliver a 50% or greater performance boost to many fine-grained parallel applications running on 32 or more processors.
Our approach was to create a new Linux scheduling policy. To achieve the desired gains, this policy must synchronize the schedulers on all nodes in a Linux HPC system to ensure that MPI processes run concurrently on all processes and that Linux housekeeping processes execute at the same time. Thus, the scheduler must have a means to achieve global synchronization, as illustrated in Figure 1. To achieve global synchronization, we designed a feature in the communications processor to synchronize the clock in each processing node.
The new Linux scheduler policy defines a scheduling frame of 128 time slots, 120 of them reserved for application execution and eight for housekeeping processes. Schedulers on different processors are able to align their scheduling frames by exploiting a globally synchronized clock, which guarantees sub-microsecond variation in time between nodes in the system. At any moment in time, all processors either are executing the HPC application or running housekeeping processes (bottom half of Figure 1).
This approach to process synchronization is scalable to high processor counts, because scheduling decisions are made locally on each node. This provides a significant boost to sustained application performance by eliminating wasted CPU cycles caused by waiting at barriers.
The synchronized scheduler is implemented as a new policy augmenting the three existing policies in the scheduler associated with the Linux kernel. The Linux scheduler is invoked when the process being executed is blocked or voluntarily gives up the CPU, when the processor receives an interrupt or at the end of a 10-millisecond timeslice. The scheduler selects the next process to run based on the scheduling policy applicable to that process and its priority. With the new synchronization policy in place, Linux then selects from one of the following scheduling policies, two for real-time processing and two for conventional time-sharing processes, listed in order of decreasing precedence:
FIFO (first in, first out): a process marked FIFO runs until it relinquishes control of the CPU. This priority is used for short duration, real-time system processes. FIFO processes run ahead of others.
Round-robin: a process using this policy receives a 10-millisecond timeslice, in turn. It is available for real-time processing.
Synchronized: we added the synchronized policy to enable synchronized scheduling of processes in a multiprocessor batch job. The workload management system marks each process as using this policy when it is started. These processes and their offspring gain the benefits of synchronized scheduling.
Priority: priority scheduling is the familiar time-sharing mechanism known to Linux users. Processes using this scheduling policy have priorities associated with them and receive time proportional to their priority. All user processes and virtually all system processes run under this policy. The scheduler selects the next process to run from the policy class with the greatest precedence. FIFO and round-robin system processes run first. Processes marked for synchronized scheduling run before processes using the normal priority scheduler.
The new synchronized scheduling policy creates a scheduling frame that dictates when batch jobs and other user and system processes are executed. The frame includes a predefined number of time slots that are cycled through in sequence. A time slot represents 10 milliseconds (one system timer tick in Linux), during which the process assigned to the time slot is executed. The current implementation has 128 time slots, 120 for the execution of batch jobs and eight for other processes. During the latter time slots, the synchronized scheduling policy indicates there are no runnable batch processes, and the conventional priority scheduling policy takes over for all other housekeeping processes. When no batch jobs exist, the behavior of the Cray scheduler is indistinguishable from the conventional Linux scheduler.
The number of time slots in a scheduling frame is configurable, but it must be a power of two. The ratio of time slots reserved for batch processing versus other processes also may be adjusted. Figure 2 illustrates a typical scheduling frame, with the locations of batch time slots shown in red and housekeeping time slots in grey.
A scheduling frame is created when the first batch process is started on a node. All batch time slots are assigned to that process. The creation of additional batch processes results in an even distribution of time slots across processes. If n batch processes are created, the first batch process receives the first 120/n time slots, the second receives the next 120/n time slots and so forth. The synchronized scheduler thus is able to support batch jobs that require multiple processes on each CPU.
A batch process executes to the end of its allotted time, as long as it makes no blocking or CPU-yielding system calls. If the batch process yields the CPU, perhaps as a result of making a blocking system call, another batch process is scheduled to run. If there are no runnable batch processes, control passes to the conventional priority scheduler to run housekeeping processes. Of course, batch processes regain the CPU if they are unblocked by the handling of an interrupt.