First Beowulf Cluster in Space
The PPU consists of two anti-fuse Actel field programmable gate arrays (FPGAs), known to be more radiation-resistant than other solutions from Xilinx or Altera. Each FPGA hosts ten processing nodes (PNs), each with a 206MHz StrongARM processor and 64MB of SDRAM. Individual FPGAs are connected to three Atmel 4MB serial Flash chips containing a bootloader, the OS kernel and filesystem images, which include selected image-processing applications. Of course, programs can be added dynamically while the satellite is in space, as though it were a regular Linux cluster.
The PPU is connected to the rest of the satellite by fairly slow quad-redundant controller area network (CAN) links and two fast (200Mb/s) low-voltage differential signalling (LVDS) links for image data from the on-board camera. Figure 2 shows an overview of the hardware architecture. Most interesting to mention is that the PPU also can take over satellite control from the OBC. In fact, this is one of the experiments that is supposed to validate that software and hardware COTS components can fulfill mission-crucial tasks.
Internally, the PPU resembles a cluster-based computing system with the FPGAs providing the interconnection network. In fact, these hubs themselves can offer image-processing capabilities. The cluster concept means we can sacrifice PNs to failure and yet carry on system operation regardless. It also gives each PN sufficient autonomy to run multiple algorithms simultaneously. As each FPGA has its own independent communication links, PPU operation can continue even with severe failures, such as destruction of an entire FPGA.
A parallel bus interfaces each PN to an FPGA. Given that ten PNs communicate with one FPGA, hardware I/O pins on the FPGA become a limitation. It is impossible to support ten full 32-bit buses. A 16-bit data bus is the next logical choice but results in a halving of the effective bus bandwidth. However, considerable effort was made to ensure that this slimmed interface operates efficiently, and it has resulted in a novel 17-bit data bus, which is discussed later. From the PN perspective, the FPGA is memory mapped into address space using an addressable window concept to reduce parallel bus requirements.
Booting of the PNs is sequential to reduce peak power on start up and consists of three stages. First, the StrongARM operates in the 16-bit access mode, executing code directly from the lowest address window of the FPGA. Although this translates into half-bandwidth memory access, the small size of the ARM assembler bootloader (512 bytes) makes it acceptable. The bootloader is a tiny ARM assembler coded routine of less than 5,122 bytes that executes directly out of the FPGA's lowest address window. It initialises the StrongARM, sets up SDRAM and then loads the second stage from serial Flash. The second stage retrieves the kernel and ramdisk from serial Flash, executes the kernel decompressor and boots Linux. Finally, the third bootloader stage consists of bzImage, which decompresses itself into the appropriate memory location and then executes the kernel, which then decompresses its ext2 initrd ramdisk.
All communication to the PN occurs through FPGA. A kernel device driver plus a user-space library provide a standard interface API for Linux applications. The low-level driver maintains two filesystem character devices that implement interrupt handling and software receive/transmit buffers. In order to keep the driver efficient and simple, kernel preemption was disabled. The driver also periodically writes to a watchdog register in the FPGA, as a heartbeat signal, causing reboot on timeout.
In the PPU, writes from the PN to FPGA fall into two classes: control and message data. Message data normally is destined for another PN, whereas control data directs some action on the part of either the FPGA or PN. Similarly, reads of the FPGA by a PN also fall into these categories.
In case of message data writes from a PN to another PN via the FPGA, each item of data destined for a particular PN must be addressed. Either addressing information is part of each and every word transferred or it's set in advance. In the PPU, message paths are set in advance under PN control for efficiency reasons, assuming most transfers are large—which is without doubt the case for satellite images. But a 16-bit interface conveying 16-bit data messages must have a mechanism to distinguish between data and address packets. This could be achieved by writing these to separate address registers in the FPGA.
The situation for reading the FPGA is trickier, however. A 16-bit bus requires two reads for each message: one read to determine message type and/or length and another to convey the actual message. But because our messages have variable length, there is an immediate problem concerning the timing of such messages. The reason is an interrupt signal is used to indicate a 16-bit value waiting to be read and the PNs are under obligation to respond. So for long messages, the FPGA would read a sequence of 16-bit half-words. But it has no obvious means of distinguishing a 16-bit control word inserted into this sequence. We could prefix all half-words with a type header, but that would mean two reads per half-word of message—halving the bandwidth.
Our solution is a 17-bit bus with the StrongARMs operating in 32-bit access mode. Both raw data and commands share the physical link as half-words with their type differentiated by the state of a special 17th bit that indicates to the PN whether an incoming item is data or a control message. Most important, it does this without requiring any extra read cycles or extra bus bandwidth.
This approach wouldn't be of interest if the driver module couldn't take direct advantage of the load-store nature of the ARM and the fact that all instructions are conditional. The former implies that the 32-bit read from the 17-bit bus is loaded into an internal register before being moved to memory. The latter implies that if the 17th bit of the interface is wired to the most significant data bit, D31, rather than the more obvious choice of D16, it can be used to affect the zero flag. As a result, the data destination to one of two internal memory buffers can be controlled through conditional data moves. This is extremely efficient compared with an inefficient conditional branch that most other processors utilise. The following assembler code provides an example with r0, r1, r2 and r3 being the registers for the address of the FPGA data transfer, the control word buffer, the message word buffer and the type, respectively. In summary, the code for the optimised solution is 33% faster and uses one register less:
SCENARIO I—Default Read Method:
... LDR r4, [r0] ; Load FPGA value LDR r5, [r3] ; Load type register TST r5, #0x80000000 ; Check for D31 STREQ r4, [r1] ; Z flag set (control) STRNE r4, [r2] ; Z flag not set (message) B _repeat ; Loop again
SCENARIO II—Optimised Read Method:
... LDR r4, [r0] ; Load FPGA value STRMI r4, [r1] ; N flag set (control) STRPL r4, [r2] ; N flag not set (message) B _repeat ; Loop again ...