Automating the Physical World with Linux, Part 3: Designing around System Failure

Bryce examines some of the causes of system failure and gives some tips on how to avoid it.
Life-Cycle Maintenance

Once a control system is installed and operational, the issue of life-cycle maintenance rears its head. As described earlier, a life-cycle failure essentially means that some part of the control system breaks. This is inevitable; at some point during the system's lifetime a module will burn out, or someone will accidentally cut a network cable, or a power failure will occur, or lightning will strike, or the controller will crash (yes, even embedded Linux)--the list is endless. I dislike using the word ``will'' so much, but failures are not a matter of if but when. Good system design practices lower the probability that the most typical failures occur, and good system designers try their best to design the system, choose the hardware and implement the design so the anticipated failure occurs infrequently or is due to unusual circumstances.

Figure 2. Enemies of Life-Cycle Maintenance


We've determined that no matter how well a control system is designed, a system failure will occur at some point. Redundancy, however, gives us a way to design around this. Redundancy means duplicating features with backups so that a backup unit takes over another unit's work when it fails. Sensors, I/O modules, I/O units, network cables, infrastructure and even controllers all can be duplicated.

Redundancy does not eliminate system failure, but it allows the control system to tolerate a failure and continue to operate--thus the term ``fault tolerance''. However, someone still must repair the damage to prevent an inevitable system failure. I state this because so many times in my career, a customer is told that a system is redundant and fault-tolerant, and then wonders why I give them a maintenance schedule. Or worse, they wonder why the redundant and fault-tolerant system completely failed after being ignored for three years.

Revisiting Distributed Control with Redundancy

In the second article in the series, I introduced the concept of distributed control where multiple control systems interact. With redundancy, a duplicate and redundant control system can monitor a primary control system but also can take over in the event of a system failure.

Backup systems can grow to be quite complicated rapidly, but this is how a simple backup system works. I have two redundant systems identical to each other. The main system is called the primary and the second system is the backup. There is a dedicated network link that connects the primary and backup controllers. Recall that a controller is typically the Linux computer (or computers) that runs the software-control algorithm. The primary controller ``holds'' the physical system and sends status updates to the backup controller. If there's an anomaly the primary system can detect, the primary controller sends an alert to the backup. Otherwise, the primary controller continues to send updates to the backup system. The backup control system watches the status updates being sent by the primary control system. Here are a few scenarios where the backup controller would come into play.

Scenario 1: Primary Controller Detects an Anomaly

The primary controller determines that a failure has occurred (network cable, infrastructure, power supply, I/O, sensor, etc.) and sends an alarm message to the backup. The primary controller also logs the failure to a report file. The backup receives the messages and brings its I/O system up. At the same time, it forces the primary system off-line to prevent the two controllers from competing on the system. A warning alarm sounds to alert personnel of the failure. The backup system holds the system until it is manually directed to release. Upon this release, the primary controller resumes control and the backup returns to an idle state.

Scenario 2: Primary Controller Fails

When the primary controller fails, it stops communication updates from being sent to the backup controller. The backup controller senses this by having a timeout occur while waiting for a status update from the primary. The backup controller takes over the system, sounds an alarm and logs its transfer of control so maintenance personnel can determine what caused the transfer.

I'd like to point out that if the link from the primary to secondary network fails, the power to the primary controller fails, or a component of the primary fails, a similar failure scenario results.