Parallel Computing Using Linux

Various classes of problems lend themselves to parallel computing solutions. This article discusses the concepts and shows how Linux can be used to address the problem.
Shared Memory Constructs

Another approach to parallel programming is the use of shared memory parallel language constructs. The idea behind this scheme is to identify the parallel and sequential regions of a program (Figure 2). The sequential region of a program is run on a single processor while the parallel region is executed on multiple processors. A parallel region consists of multiple threads which can run concurrently.

Figure 2. Parallel and Sequential Regions of a Program

For some programs, identifying parallel and sequential regions may be a trivial task, but for other programs it may involve modifying the sequential program to create parallel regions. The easiest approach is to rely on the compiler to determine parallel regions. This is the automatic parallelization approach which usually gives poor results. Most compilers are conservative when introducing parallelism and will not parallelize code if there is any ambiguity. For example, if elements of an array x are accessed through an index array, e.g., x(index(i)), in a loop, the loop will not be parallelized since the compiler cannot know if the elements of array x will be accessed independently in the loop.

The time-consuming part of any program is usually spent in executing loops. Parallel regions are created for a loop to speed up the execution of a loop. For the compiler to build a parallel region from a loop, the private and shared data variables of the loop must be identified. The example below is for the Silicon Graphics Challenge machine, but the concepts are similar for other shared-memory machines such as the Digital Alpha, IBM PC or Cray J90. Shared-memory constructs are placed before and after the loop.

#pragma parallel
#pragma shared (a,b,c)
#pragma local (i)
#pragma pfor iterate (i=0;100;1)
 for (i = 0; i < 100; i++) {
  a(i) = b(i) * c(i)
#pragma one processor

The code before the first pragma statement and after the last pragma statement is executed on a single processor. The code in the loop is executed in parallel using multiple threads. The number of threads used is based on an environment variable. Identifying shared and private variables is easy for simple loops, but for a complex loop with function calls and dependencies it can be a tedious job.

Programming using shared-memory constructs can be simpler than message passing. A parallel version of a program using shared-memory constructs can be developed more rapidly than a message-passing version. While the gain in productivity may be appealing, the increase in performance depends on the sophistication of the compiler. A shared-memory compiler for parallel programs must generate code for thread creation, thread synchronization and access to shared data. In comparison, the compiler for message-passing is simpler. It consists of the base compiler with a communication library. While no one can predict better performance using shared-memory constructs or message-passing, the message-passing model offers more scope for optimization. Data collection and distribution algorithms can be optimized for a particular application. Domain decomposition can be performed based on communication patterns in the code.

Another advantage of message passing is portability. A message-passing program written using a standard communication protocol can be ported from a network of PCs to a massively parallel supercomputer without major changes. Message passing can also be used on a network of heterogenous machines.

Each vendor of a shared memory compiler has chosen a different syntax for compiler directives. Therefore, code parallelized using directives is usually restricted to a particular compiler.

What is PVM?

Parallel Virtual Machine (PVM) is currently the most popular communication protocol for implementing distributed and parallel applications. It was initially released in 1991 by the Oak Ridge National Laboratory and the University of Tennessee and is being used in computational applications around the world. PVM is an on-going project and is freely distributed. The PVM software provides a framework to develop parallel programs on a network of heterogenous machines. PVM handles message routing, data conversion and task scheduling. An application is written as a collection of co-operating tasks. Each task accesses PVM resources through a collection of library routines. These routines are used to initiate tasks, terminate tasks, send and receive messages and synchronize between tasks. The library of PVM interface routines is one part of the PVM system. The second part of the system is the PVM daemon, which runs on every machine participating in the network for an application and which handles communication between tasks.

PVM can use streams or datagram protocols (TCP or UDP) to transmit messages. Normally, TCP is used for communication within a single machine, and UDP is used for communication between daemons on separate machines. UDP is a connectionless protocol which does not perform error handling. Therefore, when UDP is used for communication, PVM uses a stop-and-wait approach for every packet. Packets are sent one at a time with an acknowledgement for each packet. This scheme gives poor bandwidth on a system with a high latency. An alternative approach is to use TCP directly between tasks bypassing the daemon (Figure 3).

Figure 3. Inter-host communication using UDP

TCP is a reliable transport protocol and does not require error checking after transmitting a packet. The overhead for using TCP is higher than UDP. A separate socket address is required for every connection and additional system calls are used. So, while TCP gives better performance, it is not scalable.

Figure 4. Message Flow within a Host

A number of steps must be completed to send a message from host 1 to host 2 (Figure 4). In the first step, a message buffer is initialized on host 1. Second, the message data is packed using the PVM pack routines. Finally, the message is labelled with a message tag and sent to host 2. To receive a message, host 2 issues a PVM call with a message tag and a host ID. Optionally, a wildcard message type or host ID can be used. Host 2 must then unpack the message in the order it was packed at host 1.

Experimental PVM implementations using threads and ATM (asynchronous transfer mode) have been developed to obtain a higher bandwidth and lower latency. While the use of PVM is widespread, the Message Passing Interface (MPI) standard is gaining popularity.