NEC Fault-Tolerant Linux Server

NEC Corporation's Express5800/320La is the first commercially available general-purpose server offering hardware fault tolerance for Linux. Intended for standalone use or as an element in a high-availability cluster, this server features redundant CPUs, memory, disk, I/O and power. Hardware failover circuitry allows normal operation to continue despite loss of any single unit. Hot-swap capability extends beyond the usual power supply and disk. If a CPU, RAM or I/O card fails on this system, it is isolated and processing continues without interruption. You may replace the failed item at your convenience, without taking the entire system down. This could provide significant cost savings, for example, to a company needing servers that are always up, at far-flung locations where technical support might be hours away. Applications require no high-availability modifications to use this system as a standalone server, nor do they require failover scripts and planning.

Thousands of these servers have shipped with other operating systems, and now Linux is available on them. A stock Linux kernel provides too little error detection and recovery for this mode of operation, so NEC has added extensive hardening. SCSI, Ethernet and Fibre Channel drivers and support code in particular are modified to provide fault detection and failover. NEC's currently shipping kernel is based on version 2.4.2, with backports of some later changes. At the time of this writing, NEC was reviewing and documenting its kernel changes for a planned public release, perhaps through OSDL's Carrier Grade Linux Project. NEC is a founding member and a sponsor of OSDL.

Features

The Express5800/320La has four Pentium III 800MHz processors arranged in pairs together with RAM and other circuitry, in two hot-swappable CPU modules. Both modules run the same instructions in lockstep, checking each other's outputs. A failed unit is isolated almost instantly, allowing processing to continue with no observable interruption. Monitoring software keeps tally of recoverable failures, such as ECC corrections to memory output, allowing diagnosis of certain incipient problems prior to larger failures. The stock filesystem on this server is ext2.

A total of three pairs of internal 18, 36 or 73GB drives may be installed and configured in RAID-1 pairs, providing up to 219GB of internal storage. An NEC S1200 RAID array may be connected through a redundant Fibre Channel, providing up to 2TB of additional fault-tolerant storage.

Two PCI modules feature dual identical sets of PCI cards. The base unit has one Ethernet card in each module. Both cards are connected to the same network; when one fails, the other takes over using the same MAC and IP addresses. All modules and power supplies plug in to a passive backplane.

Hardware watchdog timers look for system failure—for example, a system lockup due to kernel panic—and may be configured to initiate an automatic reboot either to full run mode or to diagnostic mode.

This server is large, measuring 14“ wide by 21.5” high by 27.5“ deep and weighing about 150 pounds. An 8U rackmount version also is available. A three-year warranty is included. Telephone support is provided by NEC during regular business hours.

Unpacking and Startup

Unpacking our review unit's well-traveled shipping crate, I observed a warning sticker on the case saying “Exercise caution when handling the system to avoid personal injuries.” NEC isn't kidding. The help of a strong coworker was needed to lift this thing gently out of its shipping crate and place it on the floor. Our demo unit had dual Seagate ST318404LC 18G SCSI drives, 1GB of RAM and two Ethernet cards.

Internal assemblies look to be well made, with no tools required for removal and replacement. Better labeling of the units would be nice, though. Fans are located in the removable units, so you don't have to take one of these servers down to replace a failing fan. Even the power cords are redundant. This allows powering the server from two independent power sources, not to mention letting the harried system administrator unplug a cord to untangle it without interrupting anything.

After pressing the power switch, located under a hinged plastic protective lid, a chorus of cooling fans kicked in with a hearty whoosh measuring 63 dBA at the front panel, 74 dBA at the back. The front panel LCD status monitor showed diagnostic messages and LEDs flashed. After about two minutes, the system completed a power-on self-test and booted up into NEC Linux, which is based on Red Hat Linux 7.1.

The popular bonnie++ disk test program was the first thing we tried on this system. Immediately upon bonnie++ startup, the fault light on one CPU module came on. The test completed, as expected, but it seemed prudent to correct the problem with the server. An NEC engineer reached over the support line had us run a few tests, and then suggested that the passive backplane had suffered mechanical damage, possibly in shipping. The backplane isn't hot-swappable. He wanted to examine it, so we arranged an exchange of servers. The new server arrived in good time, booted up and survived bonnie++ quite nicely.

To test networking recovery, I unplugged the Ethernet cables from each of the two Ethernet cards, one at a time. Ping indicated a few packets were lost, but overall communication was maintained. An rsync between the test unit and another server completed without error, despite continual unplugging of alternate cables, one at a time, with several seconds of overlap while both were plugged in.

While running bonnie++, I disconnected power to each CPU module and then reconnected it. In each case the CPU module came back up after running diagnostics for a couple of minutes. The disk benchmark results were unaffected.

______________________

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState