Automating the Physical World with Linux, Part 3: Designing around System Failure

by Bryce Nakatani

This is the last in a series of articles introducing the field of control automation and its use with Linux. In this final article, I'll introduce the concept of system failure and present some ways to design around it. The importance of preparing an automation system to deal with the unexpected cannot be overstated; high-throughput hardware and embedded Linux kernels let us build powerful automation systems, but not planning for when these systems fail could lead to catastrophe.

The first two articles in the series established the ease with which Linux can be used to implement control automation systems. In the first article (see the May/June 2001 issue of ELJ), we saw a simple, Linux-based sprinkler control system and a temperature control system. Both systems used control algorithms based upon how I actually performed these tasks manually. The first article also introduced the I/O unit: the hardware that interfaces an embedded controller with the external world so the controller can acquire data from and send commands to a device in the physical world.

The second article (see the July/August 2001 issue of ELJ) discussed how to integrate control functions through coordination. Individual control tasks can be organized to solve a larger problem or to provide an orchestrated action. A hypothetical, lavish resort was introduced to demonstrate coordinated actions among lighting, access and other control systems. For example, the lawn area for special events has sprinkler controls that not only irrigate the grass automatically, but also coordinate with lighting and access-control systems to prevent resort guests from getting wet.

System Failure

Fundamentally, a control system automates a physical task, such as watering a lawn, so it can occur without the need for human intervention. This automation reduces or eliminates the factor of human error and generally means that the task is performed regularly and reliably. However, it also means that a human operator usually isn't present to respond to problems that may occur. Similarly, networking multiple control systems together allows highly complex actions to be performed with the same regularity and precision. Such a highly integrated and coordinated system, however, further distances the human operator from the tasks being performed.

With control systems designed to reduce the need for a human operator, chances increase that a system failure may occur unnoticed and result in a problem. Depending on the control system's application (that is, the tasks that are being controlled), system failures may be catastrophic, causing financial loss, property damage and personal injury. System failure is a statistic that every control-system designer must consider.

For a particular application, the likelihood that a system failure may occur (and the potential results if it doesn't) justifies the amount of effort put into designing fault tolerance into a control system. For example, a sprinkler system that stays on for two days will lead to a higher water bill but may not result in property damage or personal injury (except for the loss of some plants from overwatering).

Detection and recovery are viable options in addressing system failures. Additional hardware can be added to oversee a system; usually the additional hardware costs are insignificant compared to the cost of a single failure that goes unnoticed. Hardware that is added to oversee the system may provide not just some kind of fail-safe recovery but also alert personnel.

System Failure Categories

There are two general categories of system failure: failures related to the design of the control system itself and life-cycle failures of the physical system. (Since we're focusing on autonomous control systems, I'm excluding the category of system failure due to operator error.) Failures related to the design of the control system may be due to software design flaws, improperly installed and calibrated devices, or control algorithms that are incorrect or inadequate for the tasks being controlled. Simulation and validation is the solution for detecting these problems.

A life-cycle failure of the physical system essentially means that some part of the control system has broken. Obviously this covers a wide range of control-system elements: power supply, embedded controller(s), infrastructure, cables, sensors, actuators and other components. Maintenance is typically the solution for avoiding life-cycle failures.

Figure 1. Categories of System Failures

Using Simulation to Avoid Design Flaws

The simple sprinkler control system introduced in the first article would not benefit from a simulator. Since my embedded controller never inspects what the sprinklers are doing (there is no feedback, which is the result of an open-loop system), it will never be able to detect a failure. I can change this to make using a simulator worthwhile. If I add a device that senses when water is flowing in the pipe, the controller can detect whether water is flowing when it shouldn't be (for example, when all the valves are closed).

In the case of my sprinkler system, I could simulate a flow sensor by adding a simple on/off switch. When the computer opens the water valve, I manually turn on the switch. The switch simulates a flow sensor sending a signal to the controller, meaning that water is flowing through the pipe. When the computer closes the valve, I'll leave the switch on to create an anomalous condition: the valves are closed, but the flow sensor still detects water flowing through the pipe. The controller should now take some action to respond to this condition.

Alarms

The best way to describe what the controller should do when a system anomaly occurs is ``yell, scream or blink''. A system anomaly alarm is a way for a control system to indicate to the user that something is wrong. If I connect a buzzer or horn, I program the system to turn on the audible alarm. If there's a display connected to the system, I have it turn red and flash ``FAILURE''. If there's a pager or e-mail system, alert messages are sent to people all over the world. In short, there are numerous ways to notify users that a control system has failed. It's important to ensure that the alarm action is appropriate to the situation. Don't use a low-key blinking light for a sprinkler valve that's stuck open and is flooding the golf course. On the other hand, don't send an electric shock to someone's chair; the system's users may not appreciate it.

Controllers as Simulators

Simulation is a unique science. Simulators are control systems that use mathematical or logical models to reproduce a physical system's functions. They can also test whether a control system reacts and functions properly, using scenario testing that reproduces the signals a control system receives from sensors and other devices.

There's nothing really unusual in implementing simulators. I think of a simulator of a physical system as a control system, but backward. For example, a simulator could use an embedded controller to connect an output to each of a control system's corresponding inputs. The simulation system would then send signals that match how the physical system would react and monitor how the controller tries to correct them.

Useful validation tests may be performed once a simulation system is coupled with the control system. In the case of a reciprocating engine test, for example, the simulation system can test what the control system does if the oil pressure fails or the engine temperature goes too high. This test will validate whether the criteria for engine protection operates properly. Creating complex simulation scenarios may exercise exception-handling algorithms more rigorously than would ever occur in the real control application, but ultimately this is very beneficial.

Simulators for Training and System Improvement

The word simulator probably makes most people think of flight trainers. This may demonstrate the simulator's most important role: training. Like training a pilot to fly jet aircraft, training personnel to operate a complex new control system is expensive, tedious, yet extremely important. I wouldn't sleep well at night if I knew that new employees at the nuclear power plant down the street got hands-on training on the actual reactor. This is an exaggerated example, of course, but control-system training is a serious issue.

Both those who operate and maintain a control system may become part of the simulator-scenario testing. This type of training allows the staff to become comfortable with the system and learn how to react appropriately if a system failure occurs. These tests also offer another way to improve the system's design, refine operational practices such as maintenance schedules, and implement other functional improvements that make the system more useful and also separate an average system from an excellent system.

I really can't emphasize enough the importance of this type of simulation in control-system design. This simulation is the best opportunity for the developers, designers and users/customers to work together to develop a better system. It's also the best time to make mistakes (whether accidental or deliberate) and learn from them. While mistakes on the real system can't be reversed, mistakes on a simulated control system are just like a video game: just press the reset button.

Cost is the single largest obstacle of simulation. Using a simulation system adds a significant amount of labor and material to a project. In fact, creating the simulation system is equivalent to adding another control system. The simulation system, however, allows the control system to be tested and improved without affecting the real system. Dedicating a duplicate control system with a simulation system offers the benefit of performing many new scenario evaluations, concurrent software improvements for the real control system and continual validation.

There is also a long-term financial gain to using a simulation system. In a production facility, such as an automated assembly line, any system downtime is very expensive. Installing software upgrades often requires a system to be completely shut down, and in a relatively untested control software upgrade, there's usually a very high risk that the new software is unstable. The simulation system offers not just the ability to test the new software, but to determine the time needed to upgrade the current control software to the new version. I'm certain that these reduced downtimes, coupled with a higher confidence in software operation, more than pay back the investment in the simulation system.

To me, simulations offer peace of mind by providing the ability to simulate and test any control function that you have doubts about. Testing complex systems is very difficult, and testing a complex system on the ``real'' machine is impossibly cost-prohibitive, time-consuming and always carries the chance that damage to the physical system may result. In simulation, you have the reset button, plus the time to look back and study the phenomena that caused the failure.

Life-Cycle Maintenance

Once a control system is installed and operational, the issue of life-cycle maintenance rears its head. As described earlier, a life-cycle failure essentially means that some part of the control system breaks. This is inevitable; at some point during the system's lifetime a module will burn out, or someone will accidentally cut a network cable, or a power failure will occur, or lightning will strike, or the controller will crash (yes, even embedded Linux)--the list is endless. I dislike using the word ``will'' so much, but failures are not a matter of if but when. Good system design practices lower the probability that the most typical failures occur, and good system designers try their best to design the system, choose the hardware and implement the design so the anticipated failure occurs infrequently or is due to unusual circumstances.

Figure 2. Enemies of Life-Cycle Maintenance

Redundancy

We've determined that no matter how well a control system is designed, a system failure will occur at some point. Redundancy, however, gives us a way to design around this. Redundancy means duplicating features with backups so that a backup unit takes over another unit's work when it fails. Sensors, I/O modules, I/O units, network cables, infrastructure and even controllers all can be duplicated.

Redundancy does not eliminate system failure, but it allows the control system to tolerate a failure and continue to operate--thus the term ``fault tolerance''. However, someone still must repair the damage to prevent an inevitable system failure. I state this because so many times in my career, a customer is told that a system is redundant and fault-tolerant, and then wonders why I give them a maintenance schedule. Or worse, they wonder why the redundant and fault-tolerant system completely failed after being ignored for three years.

Revisiting Distributed Control with Redundancy

In the second article in the series, I introduced the concept of distributed control where multiple control systems interact. With redundancy, a duplicate and redundant control system can monitor a primary control system but also can take over in the event of a system failure.

Backup systems can grow to be quite complicated rapidly, but this is how a simple backup system works. I have two redundant systems identical to each other. The main system is called the primary and the second system is the backup. There is a dedicated network link that connects the primary and backup controllers. Recall that a controller is typically the Linux computer (or computers) that runs the software-control algorithm. The primary controller ``holds'' the physical system and sends status updates to the backup controller. If there's an anomaly the primary system can detect, the primary controller sends an alert to the backup. Otherwise, the primary controller continues to send updates to the backup system. The backup control system watches the status updates being sent by the primary control system. Here are a few scenarios where the backup controller would come into play.

Scenario 1: Primary Controller Detects an Anomaly

The primary controller determines that a failure has occurred (network cable, infrastructure, power supply, I/O, sensor, etc.) and sends an alarm message to the backup. The primary controller also logs the failure to a report file. The backup receives the messages and brings its I/O system up. At the same time, it forces the primary system off-line to prevent the two controllers from competing on the system. A warning alarm sounds to alert personnel of the failure. The backup system holds the system until it is manually directed to release. Upon this release, the primary controller resumes control and the backup returns to an idle state.

Scenario 2: Primary Controller Fails

When the primary controller fails, it stops communication updates from being sent to the backup controller. The backup controller senses this by having a timeout occur while waiting for a status update from the primary. The backup controller takes over the system, sounds an alarm and logs its transfer of control so maintenance personnel can determine what caused the transfer.

I'd like to point out that if the link from the primary to secondary network fails, the power to the primary controller fails, or a component of the primary fails, a similar failure scenario results.

Scenario 3: Backup Controller Fails While Primary Controller Holds the System

The communication link provides a means via a data protocol to tell the primary controller if the backup controller doesn't respond. When the primary controller detects the backup has failed, it may sound an alarm to inform maintenance personnel that there is something wrong with the backup system. While this isn't a severe warning, the status could become critical if the primary system fails.

Scenario 4: Backup Controller Fails While Holding the System

The backup controller fails while maintaining the system after a primary controller failure. At this point, the physical system no longer has any control since both primary and backup controllers are off-line. Yes, this is really bad news, to say the least. Hopefully the maintenance crew will prevent this scenario from occurring.

At this point I've only discussed controller failures. In this last scenario, if both controllers fail, neither can activate an alarm. The I/O unit, however, may be able to perform some tasks independent of the controllers. For example, some I/O units can detect a communications timeout, which is an event triggered if there is a delay of communication in a specific time period. If both controllers fail, they will stop scanning the I/O. Upon detecting the timeout, the I/O unit can perform a simple action. In this case, it will perform a hard shutdown of the physical system. At the same time, it will activate a very loud alarm and a very bright blinking red light!

Detecting Failure

How does a control system detect a failure? The scenarios we've looked at assume that a system failure can be detected, which is fortunately the case for most failures. Many failure detection methods are very simple. I'll expand on some common methods I've used and introduce a few others as well.

Communication watchdogs: one way a control system can detect when another system fails is to test the rate at which that system sends messages to it. If a control system that sends data to a redundant system goes silent (ceases communications), a general assumption can be made that the system has failed. This failure could be in the control system itself, or it might be in the communications link between primary and redundant controllers (for example, a network cable is cut). The system that detects the silence would typically perform a ``communication watchdog event'', which may be anything from triggering an alarm to turning off the control system.

Redundant sensors: recall from the first article that a control system's I/O unit receives signals from sensors, for example a temperature probe or door contact. Detecting sensor failure can sometimes be a bit difficult. For example, if a sensor measuring the temperature of a fish tank reported a value of -100°F (-73°C) or 350°F (176°C), we could deduce that we have frozen fish, steamed bass or a faulty sensor. Of course, these values don't make sense, so we could apply a ``sanity check'' to the reported value to make sure it falls within a range of realistic temperatures. Another method to address sensor failure is to add a second, redundant sensor and compare its value with the primary sensor. When readings from the two sensors don't agree, you know there's something wrong with one of them. To determine which sensor is correct, however, would actually require a third sensor. With the third reading, the control system can effectively ``vote'' for the correct value.

Additional I/O points: adding I/O points to the control system is another way to guard against system failure. For example, an I/O output controlling a light may have two additional sensors attached to its circuit. One sensor can monitor whether voltage is available for the light, and the other sensor can monitor the amount of power the light consumes. This way, the light can be monitored for bulb failure (circuit voltage good, but no power consumption) or a blown circuit breaker (no voltage is available). This system could possibly also detect more unusual conditions, for example if the light is consuming too much power. If the lightbulb fails, the system could report a ``circuit failure'' or ``bulb failure'' alarm. The alarm could even suggest the maintenance locations and parts needed to repair the failure.

Single point failures: single point failures are perhaps the most troublesome kind of system failures. If the water supply in our sprinkler system fails, for example, we can't water. There's really no practical way to provide a backup water supply, so this would be considered a single point failure. Any system design may have a few of these types of situations; despite adequate planning, they are unavoidable. I typically handle them by listing single point failures in a document and describing why they hold such a status. In the case of the failed water supply, for example, no water supply means that plants won't be watered. This particular single point failure may prove catastrophic for the plants over the long term but doesn't represent a physical hazard to operators and other customers.

The existence of potential single point failures is sometimes due to budget considerations. If I had a water reservoir, I could install an alternate water supply. Clearly this is too expensive, so in designing the sprinkler control system, I chose to allow such a potential failure to exist. Bear in mind that every control system has a single point failure. For example, every system needs electricity to operate. Backup generators can cover short outages, but over a longer period this backup generation will eventually fail due to fuel shortage or generator failure. What constitutes a single point failure is ultimately a question of how broadly you look at a control system's operation.

Failure detection comes at a price. All these methods to detect and avoid system failure require extra software, hardware and/or labor. By now it should be clear that designing a control system to tolerate failure can be expensive, and that sometimes cost or practical considerations make it necessary to allow certain single point failures to exist.

Conclusion

I hope this series ``Automating the Physical World with Linux'' has been enlightening to those new to the field of control automation. We've covered some essential concepts: building on simple algorithms such as those for lawn sprinklers, a control system can grow in complexity to control and monitor complicated tasks. Pairing Linux's well-established networking capabilities with such a coupled and distributed system allows coordinated automation functions over a large geographic area (such as our lavish resort). A control-system designer must also consider how vulnerable a system is to failure; system failures need to be identified and detected, and the customer may need to dictate how this is to occur.

Bryce Nakatani (linux@opto22.com) is an engineer at Opto 22, a manufacturer of automation components in Temecula, California. He specializes in real-time controls, software design, analog and digital design, network architecture and instrumentation.

Load Disqus comments