Linux Clusters at NIST
The first step in assessing the performance of the cluster was to determine the performance of the cluster nodes in terms of memory bandwidth. Memory and bus bandwidth performance can limit the effective use of the network bandwidth.
We used a NIST benchmark, memcopy, to determine the main memory bandwidth of the cluster nodes. For buffer sizes greater than the cache size but smaller than main memory size, the size of the buffer transferred did not affect the transfer rate. On the 200 MHz Pentium-Pro machines, we measured a peak transfer rate of 86MBps (672Mbps). On the 333 MHz Pentium II machines, the measured rate was 168MBps. Both of these rates far exceeded the line speeds of the ATM and Ethernet networks. Therefore, memory bandwidth is not a factor in utilizing the peak transfer rates of the network.
We measured the throughput and latency of the network using netperf (see Resources 4) and our own test kernel called pingpong. Using pingpong along with the MultiKron allowed a direct and precise measurement of the latency between cluster nodes. netperf was used to measure TCP and UDP performance, while variations of the pingpong program were used to measure the performance of the LAM, TCP, UDP and ATM socket levels.
Using the netperf stream benchmarks to measure throughput, we measured a peak rate of 133.88Mbps (86% of the OC-3 line rate) for TCP/IP over ATM. For TCP/IP over Ethernet, we measured 94.85Mbps (95% of the line rate). Both of these rates are near the maximum payload rate for the respective networks.
Measuring throughput with the pingpong program provided more insight into the network performance. While the netperf results tended to produce smooth curves, much more variability in the throughput was seen with pingpong as the message size increased. For messages below 16KB, Fast-Ethernet performed better than ATM when using TCP/IP. At this message size, Fast-Ethernet is near its maximum throughput, while ATM is not. With messages larger than 16KB, ATM throughput increases to surpass Fast-Ethernet.
While running the throughput tests, we noticed that the TCP/IP throughput drops dramatically when the message size is near 31KB. By using the MultiKron toolkit to probe the network stack in the Linux kernel, we were able to find the cause of the throughput drop. With the Linux 2.0.x kernels, transmission of the last message segment is delayed, even though the receiver window has opened to include room for the segment. We modified the kernel TCP software to prevent this delay, resulting in the elimination of the performance dip. (For details, see http://www.multikron.nist.gov/scalable/publications.html.)
To measure the latency of message transmission, we sampled the synchronized MultiKron clocks on the two cluster nodes involved in the data transfer. The latency is the time required for a minimum length message to be sent from one node to another. Table 1 gives the results of the measurements for different layers of the network stack. The values given are the one-way times from sender to receiver. Therefore, the TCP/IP measurement includes the device driver and switch time as well. Likewise, the device driver measurement includes the switch time.
The latency added by the ATM switch is greater than that of Fast-Ethernet for small messages. However, as message size grows, so does the latency added by the Fast-Ethernet switch, while the ATM switch latency stays constant. Table 2 shows the application layer latency when sending various-sized packets, using both the Fast-Ethernet switch and a crossover wire. As can be seen in the table, the latency added by the switch is between 123 and 131 microseconds. These latency values were consistent for several switches from different manufacturers. The cause is the buffering of each frame until it is completely received, rather than buffering only the header bytes, then overlapping the send and receive after the destination is determined from the frame header. (We have confirmed this with one switch manufacturer.) Although the latency is constant for each packet, it is easily hidden by pipelining for all but the first packet in a burst.
We have run several NIST applications on the cluster. Most of these applications are computation-bound, with little disk access. One exception is a speech processing job described below.
Our speech processing application is a batch job submitted piecemeal to each cluster node from a central server. This job was used to process over 100 hours of recorded speech. The processing involves analyzing the speech to produce a text translation. The job ran for nearly three weeks with little interruption on the cluster. The total CPU time used for the processing was over 42 million seconds across the 32 cluster nodes. Each piece of the job transfers 50MB of data from the central server via the Network File System (NFS) before starting the computation. Linux NFS has proven to be very stable. Overall, 6464 sub-jobs were run as part of the speech processing application, with 98.85% completing successfully.
Another NIST application run on the cluster, OA, predicts the optical absorption spectra of a variety of solids by considering the interaction of excitons. The bulk of the computation is based on a fast Fourier transform (FFT)/convolution method to calculate quantum mechanical integrals. The OA application was run on the cluster as well as Silicon Graphics (SGI) Origin and Onyx systems. Figure 2 shows the execution time of the OA application on the SGI and cluster systems. The best runtime occurred on the 8-node Origin, at 500 seconds, while the runtime on the 8-node ATM sub-cluster was 900 seconds. For the 16-node ATM sub-cluster, the runtime was only slightly better, showing that the application does not scale well beyond 8 nodes. The results show nearly a factor of two performance difference between the cluster and the Origin for this application, while the cost differential is more than a factor of ten. Running the job using the ATM network decreased the runtime by 30% compared to the Fast-Ethernet network, where the runtime was 1300 seconds. This difference is due to the higher throughput obtainable over ATM.
The third NIST application, 3DMD, implements a three-dimensional matrix decomposition algorithm to solve elliptic partial differential equations. This application is considered “course-grained” because it generates large (100KB or more) messages at infrequent intervals. This application scales well as more nodes are added. Figure 3 shows the execution time of 3DMD on the SGI parallel computers and the cluster. With 16 cluster nodes, 3DMD ran faster than with the 8 Origin nodes (the maximum available on the Origin). For this application, there is a 10% performance difference between ATM and Fast-Ethernet, with ATM performing better.
Figure 4 shows the execution time of the Numerical Aerodynamics Simulation (NAS) parallel benchmarks (see Resources 8). The NAS benchmarks are packaged as a suite of programs designed to measure the performance of parallel computers on computationally intensive aerophysics applications. The NAS suite is written in FORTRAN 77 using the MPI communication standard (distributed memory). The figure shows execution times for the NAS example problems when run on the 8-node SGI Origin, 8-node SGI Onyx and 8- and 16-node cluster using both Fast-Ethernet and ATM. The number in parentheses following the machine name gives the number of processors used for the problem run. The cluster competes well with the traditional parallel machines and ATM has an advantage over Fast-Ethernet for several of the benchmarks.
The second set of benchmarks we ran were the Stanford Parallel Applications for Shared Memory (SPLASH) (see Resources 9). This benchmark suite differs from NAS in that SPLASH utilizes shared memory as opposed to distributed memory. In order to run the SPLASH suite on the cluster, we used the TreadMarks (see Resources 10) Distributed Shared Memory (DSM) system. TreadMarks emulates DSM via the Network File System. Figure 5 shows the execution times for two of the SPLASH programs, Raytrace and LU. Raytrace consists of mostly computation, with very little communication, while LU spends a large percentage (nearly 35%) of its time communicating. The graph shows that for Raytrace, the cluster performs very well; however, for LU, cluster performance does not compare favorably with the parallel machines. This performance gap is due to the high communication overhead for small messages incurred on the cluster for the LU application.
The purpose of running the NAS, SPLASH and other benchmarks is to get a feel for the types of applications a cluster can run effectively. Also, for applications similar to the SPLASH LU, where communication time is the major factor in runtime, we need to delve deeper into the Linux network software and determine how network performance can be improved for these types of applications.
One other application for the cluster is the Distributed.NET project (see http://www.distributed.net/). During periods of low activity, the RC5 encryption software is run on the cluster nodes. Each node runs the software independently of the others, so we can participate with any number of nodes. We have run all 48 nodes (plus the front end) at times, with an aggregate key processing rate of over 32 million keys per second.
Free DevOps eBooks, Videos, and more!
Regardless of where you are in your DevOps process, Linux Journal can help!
We offer here the DEFINITIVE DevOps for Dummies, a mobile Application Development Primer, and advice & help from the expert sources like:
- Linux Journal
- New Products
- New Products
- Integrating Trac, Jenkins and Cobbler—Customizing Linux Operating Systems for Organizational Needs
- Tech Tip: Really Simple HTTP Server with Python
- Dialog: An Introductory Tutorial
- RSS Feeds
- Non-Linux FOSS: Remember Burning ISOs?
- Returning Values from Bash Functions
- EdgeRouter Lite