Automating the Physical World with Linux, Part 3: Designing around System Failure

Bryce examines some of the causes of system failure and gives some tips on how to avoid it.
Life-Cycle Maintenance

Once a control system is installed and operational, the issue of life-cycle maintenance rears its head. As described earlier, a life-cycle failure essentially means that some part of the control system breaks. This is inevitable; at some point during the system's lifetime a module will burn out, or someone will accidentally cut a network cable, or a power failure will occur, or lightning will strike, or the controller will crash (yes, even embedded Linux)--the list is endless. I dislike using the word ``will'' so much, but failures are not a matter of if but when. Good system design practices lower the probability that the most typical failures occur, and good system designers try their best to design the system, choose the hardware and implement the design so the anticipated failure occurs infrequently or is due to unusual circumstances.

Figure 2. Enemies of Life-Cycle Maintenance

Redundancy

We've determined that no matter how well a control system is designed, a system failure will occur at some point. Redundancy, however, gives us a way to design around this. Redundancy means duplicating features with backups so that a backup unit takes over another unit's work when it fails. Sensors, I/O modules, I/O units, network cables, infrastructure and even controllers all can be duplicated.

Redundancy does not eliminate system failure, but it allows the control system to tolerate a failure and continue to operate--thus the term ``fault tolerance''. However, someone still must repair the damage to prevent an inevitable system failure. I state this because so many times in my career, a customer is told that a system is redundant and fault-tolerant, and then wonders why I give them a maintenance schedule. Or worse, they wonder why the redundant and fault-tolerant system completely failed after being ignored for three years.

Revisiting Distributed Control with Redundancy

In the second article in the series, I introduced the concept of distributed control where multiple control systems interact. With redundancy, a duplicate and redundant control system can monitor a primary control system but also can take over in the event of a system failure.

Backup systems can grow to be quite complicated rapidly, but this is how a simple backup system works. I have two redundant systems identical to each other. The main system is called the primary and the second system is the backup. There is a dedicated network link that connects the primary and backup controllers. Recall that a controller is typically the Linux computer (or computers) that runs the software-control algorithm. The primary controller ``holds'' the physical system and sends status updates to the backup controller. If there's an anomaly the primary system can detect, the primary controller sends an alert to the backup. Otherwise, the primary controller continues to send updates to the backup system. The backup control system watches the status updates being sent by the primary control system. Here are a few scenarios where the backup controller would come into play.

Scenario 1: Primary Controller Detects an Anomaly

The primary controller determines that a failure has occurred (network cable, infrastructure, power supply, I/O, sensor, etc.) and sends an alarm message to the backup. The primary controller also logs the failure to a report file. The backup receives the messages and brings its I/O system up. At the same time, it forces the primary system off-line to prevent the two controllers from competing on the system. A warning alarm sounds to alert personnel of the failure. The backup system holds the system until it is manually directed to release. Upon this release, the primary controller resumes control and the backup returns to an idle state.

Scenario 2: Primary Controller Fails

When the primary controller fails, it stops communication updates from being sent to the backup controller. The backup controller senses this by having a timeout occur while waiting for a status update from the primary. The backup controller takes over the system, sounds an alarm and logs its transfer of control so maintenance personnel can determine what caused the transfer.

I'd like to point out that if the link from the primary to secondary network fails, the power to the primary controller fails, or a component of the primary fails, a similar failure scenario results.

______________________

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState