Improving Application Performance on HPC Systems with Process Synchronization
So far, we have discussed only scheduling of batch jobs and system processes within a single node. However, to stop the performance thievery, this synchronized scheduler must work across all processors. Here, we encounter a critical system design criteria that makes this synchronized scheduler approach possible—the availability of global time synchronization. In our design, global time synchronization is carried out by communications processors designed within the HPC system. These processors offload communications processing from the application processors. Communications processors also run a time synchronization protocol to achieve global clock synchronization. Tight time synchronization can be achieved because the communications processors have control over packet scheduling and jitter—the difference in time between any pair of processors is less than 1 microsecond. A further advantage of delegating time synchronization to the communications processors is this load is removed from the processors carrying the application workload, leaving more time for application processing and further reducing interrupts to the application processors.
The time synchronization protocol includes additional fields for time slot alignment. The protocol uses a master-slave paradigm, where one node acts as the source of the time and time slot information and all other nodes in the system synchronize themselves to the master node's clock. The time synchronization packets received from the master identify the time slot being executed and the time elapsed since the start of the time slot, enabling precise alignment of scheduling frames across the entire HPC system.
This synchronized scheduler delivers synchronized execution of the processes in a parallel application. How much performance degradation can be avoided or how much potential performance can be gained is a function of how frequently the application uses barriers and/or collective operations, how much time is taken by system housekeeping processes and the number of processors employed by the application.
Our research indicates significant speedup can be achieved. Figures 3 and 4 show the theoretical speedup that can be achieved through the use of the synchronized scheduler, relative to the conventional priority scheduler. Figure 3 assumes that background processing requires 1.5% of the CPU, and Figure 4 assumes that 6.25% of the CPU is consumed by background processing—this is a realistic metric on most clusters. Curves are shown for applications encountering an average of 100, 200 and 300 barriers per second.
As the number of processors increases, the performance gain from the synchronized scheduler increases and asymptotically approaches a maximum value. This reflects the fact that performance doesn't continue to degrade with the conventional scheduler. After a certain processor count is reached, the probability of at least one processor being delayed by housekeeping increases to 100%. The addition of more processors does not significantly add to the application delay encountered at barriers.
By focusing on the interactions between the HPC application and the system background processes, HPC researchers identified a major culprit for performance losses in parallel applications. Additional research identified ways of preventing this thievery, but none to date have provided successful, real-life implementations. Global process synchronization using the Linux scheduler eliminates wait time due to noise and promises significant performance gains. By looking beyond the application and into the role of the rest of the HPC system, we believe we have found a scalable, real-life implementation. With Linux process synchronization using a global clock synchronization and Linux running on each processing node, the Cray implementation ensures application processes run concurrently on all processors and housekeeping is performed concurrently on all processors and bounded in time. Our process synchronization solution can prevent performance theft and increase application performance for fine-grained highly parallel applications running on 32 processors or more by up to 50%.
Resources for this article: /article/7756.
Dr Paul Terry is the Chief Technology Officer for Cray Canada, Inc., previously OctigaBay Systems, which was acquired by Cray in April 2004. He is a technology strategist for innovative computing architectures and is responsible for establishing the company's technology vision and leadership.
Amar Shan, Director of Product Management, Cray, Inc., is responsible for introducing Cray's leading-edge technical innovations and creative business solutions into the marketplace. He has more than 20 years' experience in the computing and telecommunications industries in product management, development and architecture roles.
Pentti Huttunen, Benchmarking Specialist at Cray, Inc., is responsible for researching parallel computing technologies and optimizing applications to ensure that they are running efficiently on a variety of platforms at Cray, Inc.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
| Introduction to MapReduce with Hadoop on Linux | Jun 05, 2013 |
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Validate an E-Mail Address with PHP, the Right Way
- RSS Feeds
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Introduction to MapReduce with Hadoop on Linux
- Help with Designing or Debugging CORBA Applications
- New Products
- Returning Values from Bash Functions
- Linux Systems Administrator
- Welcome to 1998
4 min 29 sec ago - notifier shortcomings
28 min 11 sec ago - heroku?
2 hours 5 min ago - Android User
2 hours 6 min ago - Reply to comment | Linux Journal
3 hours 59 min ago - compiling
6 hours 49 min ago - This is a good post. This
12 hours 2 min ago - Great, This is really amazing
12 hours 4 min ago - These posts are really good
12 hours 5 min ago - It’s a really great site you
12 hours 8 min ago








Comments
Is the LSS source code
Is the LSS source code available or in use in any standard Linux distributions?