First Beowulf Cluster in Space

 in
When a satellite's image-gathering power exceeds the bandwidth available to transmit the images, a Linux cluster right on the satellite helps decide which images to send back to Earth.
Software Error Detection and Correction

All satellites are subject to cosmic ray irradiation. Besides aging effects, the most frequent consequence is random bit-flip errors in SDRAM and the CPU. Left unchecked, these ultimately lead to large-scale data corruption. From a software perspective, the result of every calculation as well as every word in memory is suspect. It goes without saying that a mechanism to detect and correct such errors must be implemented in any space-based system.

Typical solutions for error detection and correction (EDAC) protection involves custom hardware checksum generators. But for our 20-processor PPU, a checksum solution is overly complex, so we utilise a less efficient but simpler multilayer software approach. An EDAC process periodically is scheduled in kernel space to provide error protection. A second EDAC process allows the two to be cross-checked for redundancy.

Process integrity verification in our system is performed for crucial code between scheduled runs of the EDAC processes. In addition, input and output values of protected software procedures are monitored. If unexpected values are detected, the system employs either a clean-up approach, retries the calculation, outputs a previously calculated value or uses the most significant bit-flip correction scheme. Which to use is configured on a per-function basis in a parametric verification table, which again is EDAC-protected.

C code is protected through a single header file and linkable library code. The function entry definition is inserted manually:


#define EDAC_CHECK \ entry_check_edac( __func__);

GCC resolves __func__ at compile time with the string name of the function being entered. The on-demand EDAC process is invoked prior to the function executed. A return re-definition is similar:


#define return(z)
return_check_edac( __func__,\ __builtin_return_address, z);
return(z);

The developer inserts this into the code, as in the sample program given below:

int calc(int x, int y) {
    EDAC_CHECK .....
    return(z); }

Using this, a malfunctioning program can't cause too much damage. But even if the kernel is involved, a loss of heartbeat triggers a reboot. To minimise the impact on other tasks, it's preferable that only one user application should operate on each node concurrently—but of course, this is at the user's discretion.

Applications and Algorithms

So, what is the PPU supposed to do after launch? Even though the hardware costs are almost insignificant with respect to the overall satellite budget, with a launch price of approximately 10,000 US$/kg, each gramme has to be strongly justified. Right now, the most essential PPU task is image compression using a content-driven JPEG2000 scheme. But the major advantage of the PPU is its “standing watch” capability, in which the camera continuously monitors the Earth with image data evaluated and discarded immediately if it's not valuable. In case of detecting valuable information, which is under software control, the obtained scene is kept for subsequent transmission. But even more important, X-Sat can transmit the results of its findings instantaneously to mobile terminals on the ground—each the size and price of a conventional transistor radio. The implications of such a concept are understood easily if, for example, such a system was in place when the earthquake northwest of Sumatra, Indonesia, created a tsunami wave killing more than 285,000 people on Boxing Day 2004.

Currently, two specific applications are supported: the detection of oil spills and haze observation originating from man-made and natural fires. Both make use of the additional processing power available through the FPGAs to pre-process image data streamed into the individual processors. The images in Figure 3 are examples from a simulated acquisition campaign over a complete daylight period of one day's orbits. The raw data from a 10% duty cycle covers an area of approximately 3 million km2. If only 0.001% of this data showed oil spills, this would be equivalent to 62 catastrophic Prestige oil spills. With a fully functional PPU, the processing time for simultaneous execution of both disaster-detection tasks is 25% of the total daily orbit time. In contrast, however, it allows the evaluation of the entire data instead of only a small subset on the ground.

Figure 3. The processing power of the PPU makes it possible to detect oil spills and fires on the satellite without having to download all the raw data.

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix