NEC Fault-Tolerant Linux Server

NEC Corporation's Express5800/320La is the first commercially available general-purpose server offering hardware fault tolerance for Linux. Intended for standalone use or as an element in a high-availability cluster, this server features redundant CPUs, memory, disk, I/O and power. Hardware failover circuitry allows normal operation to continue despite loss of any single unit. Hot-swap capability extends beyond the usual power supply and disk. If a CPU, RAM or I/O card fails on this system, it is isolated and processing continues without interruption. You may replace the failed item at your convenience, without taking the entire system down. This could provide significant cost savings, for example, to a company needing servers that are always up, at far-flung locations where technical support might be hours away. Applications require no high-availability modifications to use this system as a standalone server, nor do they require failover scripts and planning.

Thousands of these servers have shipped with other operating systems, and now Linux is available on them. A stock Linux kernel provides too little error detection and recovery for this mode of operation, so NEC has added extensive hardening. SCSI, Ethernet and Fibre Channel drivers and support code in particular are modified to provide fault detection and failover. NEC's currently shipping kernel is based on version 2.4.2, with backports of some later changes. At the time of this writing, NEC was reviewing and documenting its kernel changes for a planned public release, perhaps through OSDL's Carrier Grade Linux Project. NEC is a founding member and a sponsor of OSDL.


The Express5800/320La has four Pentium III 800MHz processors arranged in pairs together with RAM and other circuitry, in two hot-swappable CPU modules. Both modules run the same instructions in lockstep, checking each other's outputs. A failed unit is isolated almost instantly, allowing processing to continue with no observable interruption. Monitoring software keeps tally of recoverable failures, such as ECC corrections to memory output, allowing diagnosis of certain incipient problems prior to larger failures. The stock filesystem on this server is ext2.

A total of three pairs of internal 18, 36 or 73GB drives may be installed and configured in RAID-1 pairs, providing up to 219GB of internal storage. An NEC S1200 RAID array may be connected through a redundant Fibre Channel, providing up to 2TB of additional fault-tolerant storage.

Two PCI modules feature dual identical sets of PCI cards. The base unit has one Ethernet card in each module. Both cards are connected to the same network; when one fails, the other takes over using the same MAC and IP addresses. All modules and power supplies plug in to a passive backplane.

Hardware watchdog timers look for system failure—for example, a system lockup due to kernel panic—and may be configured to initiate an automatic reboot either to full run mode or to diagnostic mode.

This server is large, measuring 14“ wide by 21.5” high by 27.5“ deep and weighing about 150 pounds. An 8U rackmount version also is available. A three-year warranty is included. Telephone support is provided by NEC during regular business hours.

Unpacking and Startup

Unpacking our review unit's well-traveled shipping crate, I observed a warning sticker on the case saying “Exercise caution when handling the system to avoid personal injuries.” NEC isn't kidding. The help of a strong coworker was needed to lift this thing gently out of its shipping crate and place it on the floor. Our demo unit had dual Seagate ST318404LC 18G SCSI drives, 1GB of RAM and two Ethernet cards.

Internal assemblies look to be well made, with no tools required for removal and replacement. Better labeling of the units would be nice, though. Fans are located in the removable units, so you don't have to take one of these servers down to replace a failing fan. Even the power cords are redundant. This allows powering the server from two independent power sources, not to mention letting the harried system administrator unplug a cord to untangle it without interrupting anything.

After pressing the power switch, located under a hinged plastic protective lid, a chorus of cooling fans kicked in with a hearty whoosh measuring 63 dBA at the front panel, 74 dBA at the back. The front panel LCD status monitor showed diagnostic messages and LEDs flashed. After about two minutes, the system completed a power-on self-test and booted up into NEC Linux, which is based on Red Hat Linux 7.1.

The popular bonnie++ disk test program was the first thing we tried on this system. Immediately upon bonnie++ startup, the fault light on one CPU module came on. The test completed, as expected, but it seemed prudent to correct the problem with the server. An NEC engineer reached over the support line had us run a few tests, and then suggested that the passive backplane had suffered mechanical damage, possibly in shipping. The backplane isn't hot-swappable. He wanted to examine it, so we arranged an exchange of servers. The new server arrived in good time, booted up and survived bonnie++ quite nicely.

To test networking recovery, I unplugged the Ethernet cables from each of the two Ethernet cards, one at a time. Ping indicated a few packets were lost, but overall communication was maintained. An rsync between the test unit and another server completed without error, despite continual unplugging of alternate cables, one at a time, with several seconds of overlap while both were plugged in.

While running bonnie++, I disconnected power to each CPU module and then reconnected it. In each case the CPU module came back up after running diagnostics for a couple of minutes. The disk benchmark results were unaffected.