Highly Available Networking
High availability (HA) means different things to different people. This article defines availability as the percentage of time that a computer system is capable of providing the service that it is assigned to do. A good figure of availability for computer systems that are used for business critical tasks, such as running a telephone switch or enterprise data communication network, is 99.999% of the time (five nines). This translates to less than six minutes per year that the service is not available.
CompactPCI traditionally has been the platform of choice for these five nines systems, because hot swap of components in to and out of a running system is usually also a requirement.
In a highly available network there should be multiple independent paths to each system in the network to avoid single points of failure (SPOF). Physical separation is also a good idea because if both paths are in the same conduit, and the conduit gets cut by accident, the network will go down.
Key to availability is the ability to detect failure quickly and transparently switch from one LAN connection to another. Putting the burden of handling redundancy in the networking driver allows for easier HA hardening of networked applications, as it relieves the application of having to be aware of network topology.
The Linux bonding driver has the ability to detect link failure and reroute network traffic around a failed link in a manner transparent to the application. It also has the ability (with certain network switches) to aggregate network traffic in all working links to achieve higher throughput. This is sometimes referred to as trunking.
The bonding driver accomplishes this by enslaving all of the Ethernet ports in the bond to the same Ethernet MAC address, which ensures the proper routing of packets across the links. With a hub arrangement, there should not be more than one link with the same MAC address active at any one time, so the bonding driver can be set up to have only one channel active at a time. This is called active-backup mode, and it will route all traffic through one channel until it detects a failure, at which point it switches to the next backup channel.
With a switch instead of a hub, it is possible to send traffic over all live links at the same time, effectively aggregating the bandwidth of the available links. This is called the round-robin mode. Round-robin mode provides availability as well as aggregation, but not all switches are capable of supporting aggregation. The bonding documentation (see Resources) contains a list of some switches that do support aggregation. The round-robin mode sends packets over all working links, with each successive packet being sent over the next link in the bonding rotation, effectively aggregating the bandwidth of all usable links.
The program that creates the bond is the ifenslave program. It is similar in function to the ifconfig program that configures nonbonded Ethernet interfaces, except that it configures all members of the bond to the same network configuration (IP, MAC, broadcast addresses, etc.). To configure the bonding driver, use ifconfig to configure the bond0 device, and use ifenslave to configure the members of the bond (the slaves).
Many recent distributions, including the Hard Hat Linux HA Framework 2.0 release, come with bonding and ifenslave already in the distribution. Bonding is available as a patch that contains the bonding driver and the ifenslave program, as well as some other modifications necessary to make the whole package work properly. The driver can be compiled in or run as a module.
Listing 1 shows a typical configuration scenario. The first line installs the bonding driver as a module in active-backup mode with a link-status check period of 100ms. Round-robin mode would use a mode parameter of 0. The first ifconfig sets the IP address for the bonding driver. The next two ifenslave commands enslave eth0 and eth1 to the bond0 device. The bond0 device takes the MAC address of the first slave configured in the bond, and this becomes the MAC address for all devices in the bond.
The networking stack talks to the bond0 device, which sends packets out over whichever slave device is appropriate, given the mode and availability status. In Listing 1, the mode is active-backup, and the active Ethernet device is eth0. Inactive Ethernet slaves have NOARP in the status line.
When a component fails, it is not enough to detect and mask the failure. The failing component must be repaired so that the next failure does not cause loss of service. For an Ethernet cable or hub or switch, it is usually a simple matter of replacing it with a working one. For an Ethernet board in a running computer, it is not always so simple.
The PCI Industrial Computer Manufacturers Group (PICMG) has created a set of standards for CompactPCI hardware and software that make it easier to replace defective hardware in a running system. With PICMG-compliant hardware and the proper drivers and dæmons, replacing a defective board in a running system is a simple matter of removing the defective board and replacing it with a working one.
PICMG standard 2.1 is a hardware standard that covers the mechanical and electrical requirements necessary to remove and/or plug in a board in a running system (hot swap). PICMG standard 2.12 is a software standard that covers the driver requirements to handle hot-swap events. The SourceForgePICMG hot-swap site has the hot-swap driver routines and HA dæmon for handling hot swapping.
Hot swap requires additional coordination with drivers and the PCI subsystem to handle PCI devices that come and go. When an Ethernet card fails and the operator wants to remove it, all he or she has to do is open the handle switch on the CompactPCI board, and this sends an ENUM# interrupt to the PICMG 2.12 driver, which calls to the routine registered to receive hot-swap events. This routine is responsible for notifying the driver for the card, removing the device from the kernel PCI tree and turning on the blue hot-swap LED on the board, which indicates to the operator that it is safe to remove the card. It also notifies the HA dæmon so that it can do any user-space actions necessary (such as removing an Ethernet device from a bond or removing a driver that is no longer used).
When a replacement card is inserted, it also causes an ENUM# interrupt, which gets routed to the same routine mentioned above. This routine is then responsible for inserting the device in the kernel PCI tree and notifying the HA dæmon that a new device has been inserted.
The HA Dæmon (HAD) is a user-space program that receives events from the hot-swap subsystem. It takes two configuration files, one to specify which devices are supported (and their corresponding drivers) and one to specify actions to take when a hot-swap event is received.
If the hot-swap subsystem receives an insert event and does not have a driver loaded for the card that was inserted, it sends a load-driver message to the HAD. The HAD checks its device-driver configuration file (/etc/pcidrivers.conf), and if it knows the driver for the card, it loads it. If the card is unknown, the HAD just ignores the insert event.
The HAD also has another major duty with regard to hot swap, and that is configuring the card that has just been inserted. For example, if the card is involved in networking, it needs to have its address established, or if it is a member of a bond, it needs to be enslaved.
The HAD's configuration file is /etc/had.conf, shown in Listing 2. This file is for a Motorola 8216 chassis with two I/O domains. The first two lines in Section 1 state that this processor is going to control both I/O domains. A chassis with only one I/O domain may skip this section.
The first line in Section 2 indicates that the HAD will start the bonding driver for bond0 and configure it with IP address 10.0.1.1.
The next two lines define Ethernet configurations that will be used by the ports in the boards described by Section 3.
Configuration config1 is an example of a nonbonded Ethernet configuration. It has four parameters: the IP address, network mask, network address and broadcast address.
Configuration config2 is an example of a bonding configuration that will enslave any board that uses it to the bond0 device configured by the bond command in Section 2.
The remainder of the had.conf file states which configurations are used by devices in the backplane. The first parameter of the device command is the slot, and the second is the subdevice. Thus, the card in slot 2 is a dual Ethernet card, and both Ethernet ports will be enslaved to bond0. The device in slot 12 is also a dual Ethernet card, and the device in slot 16 is a single Ethernet that will be configured with the nonbonded configuration specified by config1.
The Linux bonding driver can be an important component of a highly available system and, coupled with the hot-swap capability of CompactPCI hardware, is capable of providing networking with five nines of availability.
b:The bonding driver could use a number of improvements. It only detects link failure through the Ethernet link-status indicator and could use a mechanism to diagnose more subtle failures. The bonding driver also should be enhanced to provide monitoring software with an indication of when it has detected a link failure and routed around it so that a repair strategy can be implemented. But the beauty of Linux and open source is that you don't have to wait for someone else to do it, you can do it yourself.
John Mehaffey is the author of the PICMG 2.12 driver in SourceForge, as well as a participant in a number of PICMG working groups including the 2.12 and 2.13 standards. John works for MontaVista Software as a technical marketing engineer and is also the mayor of Saratoga, California, a city of 30,000 in Silicon Valley. Contact John at email@example.com.