Improving Application Performance on HPC Systems with Process Synchronization

Hey, cluster node, don't work on that housekeeping task right now—we're all waiting for you to finish your part of the MPI job! Here's a scheduling policy designed for cluster efficiency.
Alignment of Scheduling Frames between Processors

So far, we have discussed only scheduling of batch jobs and system processes within a single node. However, to stop the performance thievery, this synchronized scheduler must work across all processors. Here, we encounter a critical system design criteria that makes this synchronized scheduler approach possible—the availability of global time synchronization. In our design, global time synchronization is carried out by communications processors designed within the HPC system. These processors offload communications processing from the application processors. Communications processors also run a time synchronization protocol to achieve global clock synchronization. Tight time synchronization can be achieved because the communications processors have control over packet scheduling and jitter—the difference in time between any pair of processors is less than 1 microsecond. A further advantage of delegating time synchronization to the communications processors is this load is removed from the processors carrying the application workload, leaving more time for application processing and further reducing interrupts to the application processors.

The time synchronization protocol includes additional fields for time slot alignment. The protocol uses a master-slave paradigm, where one node acts as the source of the time and time slot information and all other nodes in the system synchronize themselves to the master node's clock. The time synchronization packets received from the master identify the time slot being executed and the time elapsed since the start of the time slot, enabling precise alignment of scheduling frames across the entire HPC system.

Performance Implications

This synchronized scheduler delivers synchronized execution of the processes in a parallel application. How much performance degradation can be avoided or how much potential performance can be gained is a function of how frequently the application uses barriers and/or collective operations, how much time is taken by system housekeeping processes and the number of processors employed by the application.

Our research indicates significant speedup can be achieved. Figures 3 and 4 show the theoretical speedup that can be achieved through the use of the synchronized scheduler, relative to the conventional priority scheduler. Figure 3 assumes that background processing requires 1.5% of the CPU, and Figure 4 assumes that 6.25% of the CPU is consumed by background processing—this is a realistic metric on most clusters. Curves are shown for applications encountering an average of 100, 200 and 300 barriers per second.

Figure 3. Theoretical Speedups with Process Synchronization with 1.5% Dæmon CPU Utilization

Figure 4. Theoretical Speedups with Process Synchronization with 6.25% Dæmon CPU Utilization

As the number of processors increases, the performance gain from the synchronized scheduler increases and asymptotically approaches a maximum value. This reflects the fact that performance doesn't continue to degrade with the conventional scheduler. After a certain processor count is reached, the probability of at least one processor being delayed by housekeeping increases to 100%. The addition of more processors does not significantly add to the application delay encountered at barriers.


By focusing on the interactions between the HPC application and the system background processes, HPC researchers identified a major culprit for performance losses in parallel applications. Additional research identified ways of preventing this thievery, but none to date have provided successful, real-life implementations. Global process synchronization using the Linux scheduler eliminates wait time due to noise and promises significant performance gains. By looking beyond the application and into the role of the rest of the HPC system, we believe we have found a scalable, real-life implementation. With Linux process synchronization using a global clock synchronization and Linux running on each processing node, the Cray implementation ensures application processes run concurrently on all processors and housekeeping is performed concurrently on all processors and bounded in time. Our process synchronization solution can prevent performance theft and increase application performance for fine-grained highly parallel applications running on 32 processors or more by up to 50%.

Resources for this article: /article/7756.

Dr Paul Terry is the Chief Technology Officer for Cray Canada, Inc., previously OctigaBay Systems, which was acquired by Cray in April 2004. He is a technology strategist for innovative computing architectures and is responsible for establishing the company's technology vision and leadership.

Amar Shan, Director of Product Management, Cray, Inc., is responsible for introducing Cray's leading-edge technical innovations and creative business solutions into the marketplace. He has more than 20 years' experience in the computing and telecommunications industries in product management, development and architecture roles.

Pentti Huttunen, Benchmarking Specialist at Cray, Inc., is responsible for researching parallel computing technologies and optimizing applications to ensure that they are running efficiently on a variety of platforms at Cray, Inc.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Is the LSS source code

Anonymous's picture

Is the LSS source code available or in use in any standard Linux distributions?