Improving Application Performance on HPC Systems with Process Synchronization
So far, we have discussed only scheduling of batch jobs and system processes within a single node. However, to stop the performance thievery, this synchronized scheduler must work across all processors. Here, we encounter a critical system design criteria that makes this synchronized scheduler approach possible—the availability of global time synchronization. In our design, global time synchronization is carried out by communications processors designed within the HPC system. These processors offload communications processing from the application processors. Communications processors also run a time synchronization protocol to achieve global clock synchronization. Tight time synchronization can be achieved because the communications processors have control over packet scheduling and jitter—the difference in time between any pair of processors is less than 1 microsecond. A further advantage of delegating time synchronization to the communications processors is this load is removed from the processors carrying the application workload, leaving more time for application processing and further reducing interrupts to the application processors.
The time synchronization protocol includes additional fields for time slot alignment. The protocol uses a master-slave paradigm, where one node acts as the source of the time and time slot information and all other nodes in the system synchronize themselves to the master node's clock. The time synchronization packets received from the master identify the time slot being executed and the time elapsed since the start of the time slot, enabling precise alignment of scheduling frames across the entire HPC system.
This synchronized scheduler delivers synchronized execution of the processes in a parallel application. How much performance degradation can be avoided or how much potential performance can be gained is a function of how frequently the application uses barriers and/or collective operations, how much time is taken by system housekeeping processes and the number of processors employed by the application.
Our research indicates significant speedup can be achieved. Figures 3 and 4 show the theoretical speedup that can be achieved through the use of the synchronized scheduler, relative to the conventional priority scheduler. Figure 3 assumes that background processing requires 1.5% of the CPU, and Figure 4 assumes that 6.25% of the CPU is consumed by background processing—this is a realistic metric on most clusters. Curves are shown for applications encountering an average of 100, 200 and 300 barriers per second.
As the number of processors increases, the performance gain from the synchronized scheduler increases and asymptotically approaches a maximum value. This reflects the fact that performance doesn't continue to degrade with the conventional scheduler. After a certain processor count is reached, the probability of at least one processor being delayed by housekeeping increases to 100%. The addition of more processors does not significantly add to the application delay encountered at barriers.
By focusing on the interactions between the HPC application and the system background processes, HPC researchers identified a major culprit for performance losses in parallel applications. Additional research identified ways of preventing this thievery, but none to date have provided successful, real-life implementations. Global process synchronization using the Linux scheduler eliminates wait time due to noise and promises significant performance gains. By looking beyond the application and into the role of the rest of the HPC system, we believe we have found a scalable, real-life implementation. With Linux process synchronization using a global clock synchronization and Linux running on each processing node, the Cray implementation ensures application processes run concurrently on all processors and housekeeping is performed concurrently on all processors and bounded in time. Our process synchronization solution can prevent performance theft and increase application performance for fine-grained highly parallel applications running on 32 processors or more by up to 50%.
Resources for this article: /article/7756.
Dr Paul Terry is the Chief Technology Officer for Cray Canada, Inc., previously OctigaBay Systems, which was acquired by Cray in April 2004. He is a technology strategist for innovative computing architectures and is responsible for establishing the company's technology vision and leadership.
Amar Shan, Director of Product Management, Cray, Inc., is responsible for introducing Cray's leading-edge technical innovations and creative business solutions into the marketplace. He has more than 20 years' experience in the computing and telecommunications industries in product management, development and architecture roles.
Pentti Huttunen, Benchmarking Specialist at Cray, Inc., is responsible for researching parallel computing technologies and optimizing applications to ensure that they are running efficiently on a variety of platforms at Cray, Inc.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- New Products
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- Web & UI Developer (JavaScript & j Query)
- UX Designer
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Using Salt Stack and Vagrant for Drupal Development
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Nice article, thanks for the
2 hours 1 min ago - I once had a better way I
7 hours 47 min ago - Not only you I too assumed
8 hours 4 min ago - another very interesting
9 hours 57 min ago - Reply to comment | Linux Journal
11 hours 50 min ago - Reply to comment | Linux Journal
18 hours 45 min ago - Reply to comment | Linux Journal
19 hours 1 min ago - Favorite (and easily brute-forced) pw's
20 hours 52 min ago - Have you tried Boxen? It's a
1 day 2 hours ago - seo services in india
1 day 7 hours ago








Comments
Is the LSS source code
Is the LSS source code available or in use in any standard Linux distributions?