Automating the Physical World with Linux, Part 3: Designing around System Failure

Bryce examines some of the causes of system failure and gives some tips on how to avoid it.
Life-Cycle Maintenance

Once a control system is installed and operational, the issue of life-cycle maintenance rears its head. As described earlier, a life-cycle failure essentially means that some part of the control system breaks. This is inevitable; at some point during the system's lifetime a module will burn out, or someone will accidentally cut a network cable, or a power failure will occur, or lightning will strike, or the controller will crash (yes, even embedded Linux)--the list is endless. I dislike using the word ``will'' so much, but failures are not a matter of if but when. Good system design practices lower the probability that the most typical failures occur, and good system designers try their best to design the system, choose the hardware and implement the design so the anticipated failure occurs infrequently or is due to unusual circumstances.

Figure 2. Enemies of Life-Cycle Maintenance

Redundancy

We've determined that no matter how well a control system is designed, a system failure will occur at some point. Redundancy, however, gives us a way to design around this. Redundancy means duplicating features with backups so that a backup unit takes over another unit's work when it fails. Sensors, I/O modules, I/O units, network cables, infrastructure and even controllers all can be duplicated.

Redundancy does not eliminate system failure, but it allows the control system to tolerate a failure and continue to operate--thus the term ``fault tolerance''. However, someone still must repair the damage to prevent an inevitable system failure. I state this because so many times in my career, a customer is told that a system is redundant and fault-tolerant, and then wonders why I give them a maintenance schedule. Or worse, they wonder why the redundant and fault-tolerant system completely failed after being ignored for three years.

Revisiting Distributed Control with Redundancy

In the second article in the series, I introduced the concept of distributed control where multiple control systems interact. With redundancy, a duplicate and redundant control system can monitor a primary control system but also can take over in the event of a system failure.

Backup systems can grow to be quite complicated rapidly, but this is how a simple backup system works. I have two redundant systems identical to each other. The main system is called the primary and the second system is the backup. There is a dedicated network link that connects the primary and backup controllers. Recall that a controller is typically the Linux computer (or computers) that runs the software-control algorithm. The primary controller ``holds'' the physical system and sends status updates to the backup controller. If there's an anomaly the primary system can detect, the primary controller sends an alert to the backup. Otherwise, the primary controller continues to send updates to the backup system. The backup control system watches the status updates being sent by the primary control system. Here are a few scenarios where the backup controller would come into play.

Scenario 1: Primary Controller Detects an Anomaly

The primary controller determines that a failure has occurred (network cable, infrastructure, power supply, I/O, sensor, etc.) and sends an alarm message to the backup. The primary controller also logs the failure to a report file. The backup receives the messages and brings its I/O system up. At the same time, it forces the primary system off-line to prevent the two controllers from competing on the system. A warning alarm sounds to alert personnel of the failure. The backup system holds the system until it is manually directed to release. Upon this release, the primary controller resumes control and the backup returns to an idle state.

Scenario 2: Primary Controller Fails

When the primary controller fails, it stops communication updates from being sent to the backup controller. The backup controller senses this by having a timeout occur while waiting for a status update from the primary. The backup controller takes over the system, sounds an alarm and logs its transfer of control so maintenance personnel can determine what caused the transfer.

I'd like to point out that if the link from the primary to secondary network fails, the power to the primary controller fails, or a component of the primary fails, a similar failure scenario results.

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix