First Beowulf Cluster in Space

HPC

by Ian McLoughlin

on July 28, 2005

When President Eisenhower proposed the Open Skies Policy at the 1995 summit meeting to the Soviet delegation in Geneva, it was an unsuccessful move to legitimate the US' plans to launch the U2 spy plane a month later. Five decades later, Open Skies became a reality with the launch of Singapore's X-Sat. What could complement Open Skies better than open source? And it doesn't take a genius to understand that, when reliability is all important, the transparent and open nature of Linux source is an invaluable aid.

At the outset of the X-Sat Project, which focused on developing Singapore's first satellite, arguments were made for Linux but roundly were rejected. At that time, Linux was an esoteric outsider for embedded systems use and hadn't penetrated the consciousness of decision-makers in the area of space developments. Furthermore, Singapore generally is not known for risk taking, and truly there is something to be said in favour of this attitude where satellites are concerned. By contrast, VxWorks has excellent space heritage, although this is no guarantee for success.

Although the satellite's main computer runs VxWorks, Linux powers the data processing computer. This actually is a loosely coupled cluster called the parallel processing unit (PPU), and it is the first distributed example for Linux in space. The concept is to run satellite image-processing applications directly in space after a straightforward re-compilation and uploading procedure from ground-based Linux development platforms. Let's compare the main on-board computer (OBC) and the PPU (Table 1).

Table 1. X-Sat has both an on-board computer (OBC) and a parallel processing unit (PPU).

	OBC	PPU
Processors	2 x ERC32	20 x SA 1110
Configuration	Cold-redundant standby	Whatever you want
Peak performance [MIPS]	20	4,000
Total memory [MB]	8	1,280
Size [cm³]	3,125	3,125
Power consumption [Watt]	Approx. 2	25
Hardware cost [US$]	50,000	3,500
Processing cost [US$ / MIPS]	2,500	0.88
Processing volume [cm3 / MIPS]	156.25	0.78
Processing power [mW / MIPS]	50	6.25
Operating system	VxWorks	Linux
Costs for OS	A few thousand dollars	Free

The OBC is so expensive because it utilises costly radiation-hardened components, whereas the PPU uses mostly commercial-off-the-shelf (COTS) components. Traditionally, reliability in space is ensured by using the most reliable individual components to survive the hostile environment. But the PPU embodies the relatively new concept—at least in space—of reliability through redundancy. Although each single component of the PPU is less reliable in space than are the OBC components, there are 20 copies of each PPU processor, so even if one after another fails, something still remains. The design almost eliminates single-point failures, where a single component failure could take out the entire system or multiple components. On top of this, the PPU is characterised through graceful degradation from a fully working 20-processor system down to a single processor. So, good design ensures that the probability of a single PPU processor still functioning at the end of design life matches the probability that the OBC still is functioning at the same time. And even with only a single surviving processor, it still thrashes the OBC.

X-Sat

X-Sat is a 100kg micro-satellite, roughly an 80cm cuboid, as shown in the CAD model of Figure 1. The satellite, an educational project at Nanyang Technological University, Singapore, carries three payloads: a 10m resolution multispectral (colour) camera for obtaining images in the Singaporean region, a radio link for an Australian-distributed sensor network and, most notably, the PPU. From the outset of the project, X-Sat has been an open satellite with details publicly available, and what better reason to use open-source software?

Figure 1. X-Sat is about 80cm in size and carries a colour camera, radio link and a Linux cluster.

Communication uses S-band with 4kb/s uplink and 500kb/s downlink as well as a unidirectional 50Mb/s downlink over X-band for image dumps. However, the X-band needs a dedicated 13m dish antenna for reception, and it works only when the satellite is over Singapore. In its intended sun-synchronous 685km orbit, this occurs for only a few minutes every day and leads to a major rationale for the PPU. Assuming a conservative duty cycle of 10% per orbit, the camera can generate 81GB of data per day, but only 12% of this can be downlinked. And with a three-year design lifetime and multimillion-dollar cost, each picture works out to be rather expensive.

If we want maximum value per picture and we have to throw away 88% of the images, which ones do we select? Anyone who has been to Singapore should remember the overcast skies. It turns out that 90% of satellite pictures over Singapore show only clouds and haze. Although this may excite meteorologists, it's a waste of money for us. We'd get value from only 10% of the 12% of pictures we download—a 1.2% success rate! So if there's any way of deciding before downloading whether a picture is obscured by clouds or even a way of cutting out the cloudy bits and downloading the rest, then this is valuable. Well, you guessed it, such applications do exist. They run on Linux Beowulf clusters, require many MIPS and happen to be a perfect fit for our PPU.

PPU Design

The PPU consists of two anti-fuse Actel field programmable gate arrays (FPGAs), known to be more radiation-resistant than other solutions from Xilinx or Altera. Each FPGA hosts ten processing nodes (PNs), each with a 206MHz StrongARM processor and 64MB of SDRAM. Individual FPGAs are connected to three Atmel 4MB serial Flash chips containing a bootloader, the OS kernel and filesystem images, which include selected image-processing applications. Of course, programs can be added dynamically while the satellite is in space, as though it were a regular Linux cluster.

The PPU is connected to the rest of the satellite by fairly slow quad-redundant controller area network (CAN) links and two fast (200Mb/s) low-voltage differential signalling (LVDS) links for image data from the on-board camera. Figure 2 shows an overview of the hardware architecture. Most interesting to mention is that the PPU also can take over satellite control from the OBC. In fact, this is one of the experiments that is supposed to validate that software and hardware COTS components can fulfill mission-crucial tasks.

Figure 2. The cluster is based on two FPGAs, each connected to ten 206MHz StrongARM processors.

Internally, the PPU resembles a cluster-based computing system with the FPGAs providing the interconnection network. In fact, these hubs themselves can offer image-processing capabilities. The cluster concept means we can sacrifice PNs to failure and yet carry on system operation regardless. It also gives each PN sufficient autonomy to run multiple algorithms simultaneously. As each FPGA has its own independent communication links, PPU operation can continue even with severe failures, such as destruction of an entire FPGA.

A parallel bus interfaces each PN to an FPGA. Given that ten PNs communicate with one FPGA, hardware I/O pins on the FPGA become a limitation. It is impossible to support ten full 32-bit buses. A 16-bit data bus is the next logical choice but results in a halving of the effective bus bandwidth. However, considerable effort was made to ensure that this slimmed interface operates efficiently, and it has resulted in a novel 17-bit data bus, which is discussed later. From the PN perspective, the FPGA is memory mapped into address space using an addressable window concept to reduce parallel bus requirements.

Booting

Booting of the PNs is sequential to reduce peak power on start up and consists of three stages. First, the StrongARM operates in the 16-bit access mode, executing code directly from the lowest address window of the FPGA. Although this translates into half-bandwidth memory access, the small size of the ARM assembler bootloader (512 bytes) makes it acceptable. The bootloader is a tiny ARM assembler coded routine of less than 5,122 bytes that executes directly out of the FPGA's lowest address window. It initialises the StrongARM, sets up SDRAM and then loads the second stage from serial Flash. The second stage retrieves the kernel and ramdisk from serial Flash, executes the kernel decompressor and boots Linux. Finally, the third bootloader stage consists of bzImage, which decompresses itself into the appropriate memory location and then executes the kernel, which then decompresses its ext2 initrd ramdisk.

The 17-Bit Bus Interface and Protocol

All communication to the PN occurs through FPGA. A kernel device driver plus a user-space library provide a standard interface API for Linux applications. The low-level driver maintains two filesystem character devices that implement interrupt handling and software receive/transmit buffers. In order to keep the driver efficient and simple, kernel preemption was disabled. The driver also periodically writes to a watchdog register in the FPGA, as a heartbeat signal, causing reboot on timeout.

In the PPU, writes from the PN to FPGA fall into two classes: control and message data. Message data normally is destined for another PN, whereas control data directs some action on the part of either the FPGA or PN. Similarly, reads of the FPGA by a PN also fall into these categories.

In case of message data writes from a PN to another PN via the FPGA, each item of data destined for a particular PN must be addressed. Either addressing information is part of each and every word transferred or it's set in advance. In the PPU, message paths are set in advance under PN control for efficiency reasons, assuming most transfers are large—which is without doubt the case for satellite images. But a 16-bit interface conveying 16-bit data messages must have a mechanism to distinguish between data and address packets. This could be achieved by writing these to separate address registers in the FPGA.

The situation for reading the FPGA is trickier, however. A 16-bit bus requires two reads for each message: one read to determine message type and/or length and another to convey the actual message. But because our messages have variable length, there is an immediate problem concerning the timing of such messages. The reason is an interrupt signal is used to indicate a 16-bit value waiting to be read and the PNs are under obligation to respond. So for long messages, the FPGA would read a sequence of 16-bit half-words. But it has no obvious means of distinguishing a 16-bit control word inserted into this sequence. We could prefix all half-words with a type header, but that would mean two reads per half-word of message—halving the bandwidth.

Our solution is a 17-bit bus with the StrongARMs operating in 32-bit access mode. Both raw data and commands share the physical link as half-words with their type differentiated by the state of a special 17th bit that indicates to the PN whether an incoming item is data or a control message. Most important, it does this without requiring any extra read cycles or extra bus bandwidth.

This approach wouldn't be of interest if the driver module couldn't take direct advantage of the load-store nature of the ARM and the fact that all instructions are conditional. The former implies that the 32-bit read from the 17-bit bus is loaded into an internal register before being moved to memory. The latter implies that if the 17th bit of the interface is wired to the most significant data bit, D31, rather than the more obvious choice of D16, it can be used to affect the zero flag. As a result, the data destination to one of two internal memory buffers can be controlled through conditional data moves. This is extremely efficient compared with an inefficient conditional branch that most other processors utilise. The following assembler code provides an example with r0, r1, r2 and r3 being the registers for the address of the FPGA data transfer, the control word buffer, the message word buffer and the type, respectively. In summary, the code for the optimised solution is 33% faster and uses one register less:

SCENARIO I—Default Read Method:

...
LDR   r4, [r0]        ; Load FPGA value
LDR   r5, [r3]        ; Load type register
TST   r5, #0x80000000 ; Check for D31
STREQ r4, [r1]        ; Z flag set (control)
STRNE r4, [r2]        ; Z flag not set (message)
B     _repeat         ; Loop again

SCENARIO II—Optimised Read Method:

...
LDR   r4, [r0]        ; Load FPGA value
STRMI r4, [r1]        ; N flag set (control)
STRPL r4, [r2]        ; N flag not set (message)
B     _repeat         ; Loop again
...

Software Error Detection and Correction

All satellites are subject to cosmic ray irradiation. Besides aging effects, the most frequent consequence is random bit-flip errors in SDRAM and the CPU. Left unchecked, these ultimately lead to large-scale data corruption. From a software perspective, the result of every calculation as well as every word in memory is suspect. It goes without saying that a mechanism to detect and correct such errors must be implemented in any space-based system.

Typical solutions for error detection and correction (EDAC) protection involves custom hardware checksum generators. But for our 20-processor PPU, a checksum solution is overly complex, so we utilise a less efficient but simpler multilayer software approach. An EDAC process periodically is scheduled in kernel space to provide error protection. A second EDAC process allows the two to be cross-checked for redundancy.

Process integrity verification in our system is performed for crucial code between scheduled runs of the EDAC processes. In addition, input and output values of protected software procedures are monitored. If unexpected values are detected, the system employs either a clean-up approach, retries the calculation, outputs a previously calculated value or uses the most significant bit-flip correction scheme. Which to use is configured on a per-function basis in a parametric verification table, which again is EDAC-protected.

C code is protected through a single header file and linkable library code. The function entry definition is inserted manually:


#define EDAC_CHECK \ entry_check_edac( __func__);

GCC resolves __func__ at compile time with the string name of the function being entered. The on-demand EDAC process is invoked prior to the function executed. A return re-definition is similar:


#define return(z)
return_check_edac( __func__,\ __builtin_return_address, z);
return(z);

The developer inserts this into the code, as in the sample program given below:

int calc(int x, int y) {
    EDAC_CHECK .....
    return(z); }

Using this, a malfunctioning program can't cause too much damage. But even if the kernel is involved, a loss of heartbeat triggers a reboot. To minimise the impact on other tasks, it's preferable that only one user application should operate on each node concurrently—but of course, this is at the user's discretion.

Applications and Algorithms

So, what is the PPU supposed to do after launch? Even though the hardware costs are almost insignificant with respect to the overall satellite budget, with a launch price of approximately 10,000 US$/kg, each gramme has to be strongly justified. Right now, the most essential PPU task is image compression using a content-driven JPEG2000 scheme. But the major advantage of the PPU is its “standing watch” capability, in which the camera continuously monitors the Earth with image data evaluated and discarded immediately if it's not valuable. In case of detecting valuable information, which is under software control, the obtained scene is kept for subsequent transmission. But even more important, X-Sat can transmit the results of its findings instantaneously to mobile terminals on the ground—each the size and price of a conventional transistor radio. The implications of such a concept are understood easily if, for example, such a system was in place when the earthquake northwest of Sumatra, Indonesia, created a tsunami wave killing more than 285,000 people on Boxing Day 2004.

Currently, two specific applications are supported: the detection of oil spills and haze observation originating from man-made and natural fires. Both make use of the additional processing power available through the FPGAs to pre-process image data streamed into the individual processors. The images in Figure 3 are examples from a simulated acquisition campaign over a complete daylight period of one day's orbits. The raw data from a 10% duty cycle covers an area of approximately 3 million km². If only 0.001% of this data showed oil spills, this would be equivalent to 62 catastrophic Prestige oil spills. With a fully functional PPU, the processing time for simultaneous execution of both disaster-detection tasks is 25% of the total daily orbit time. In contrast, however, it allows the evaluation of the entire data instead of only a small subset on the ground.

Figure 3. The processing power of the PPU makes it possible to detect oil spills and fires on the satellite without having to download all the raw data.

Launch into a New Space Era

From an engineer's perspective, X-Sat and its PPU couldn't succeed without Linux: almost all current application developments in the area of remote sensing use Linux, as do most modern cluster systems. So, sometime in early 2007, if you tilt your head back at the right time, you might be caught on camera, processed and downloaded, thanks to Linux.

Resources for this article: /article/8399.

Ian's been using Linux since about 1856 and weaned his kids at the penguin's electronic teat. His interests include satellites and signal processing, and his career objective is to lose his job and become a missionary in China.

Although Timo didn't try to wean his daughter on the penguin, he uses Linux for most of his number-crunching problems on Beowulf clusters and in the future even more extensively in space. Timo's research focus is remote sensing and various image-processing problems, well, unless he's gone traveling.

Bharath designs high-performance systems for Hewlett-Packard. Not being very high-performance himself, he relies on the pet monkey under his desk to come up with hardware designs. Occasionally, it also writes articles for magazines with penguins on their cover.

Load Disqus comments