Automating the Physical World with Linux, Part 3: Designing around System Failure

Bryce examines some of the causes of system failure and gives some tips on how to avoid it.
Scenario 3: Backup Controller Fails While Primary Controller Holds the System

The communication link provides a means via a data protocol to tell the primary controller if the backup controller doesn't respond. When the primary controller detects the backup has failed, it may sound an alarm to inform maintenance personnel that there is something wrong with the backup system. While this isn't a severe warning, the status could become critical if the primary system fails.

Scenario 4: Backup Controller Fails While Holding the System

The backup controller fails while maintaining the system after a primary controller failure. At this point, the physical system no longer has any control since both primary and backup controllers are off-line. Yes, this is really bad news, to say the least. Hopefully the maintenance crew will prevent this scenario from occurring.

At this point I've only discussed controller failures. In this last scenario, if both controllers fail, neither can activate an alarm. The I/O unit, however, may be able to perform some tasks independent of the controllers. For example, some I/O units can detect a communications timeout, which is an event triggered if there is a delay of communication in a specific time period. If both controllers fail, they will stop scanning the I/O. Upon detecting the timeout, the I/O unit can perform a simple action. In this case, it will perform a hard shutdown of the physical system. At the same time, it will activate a very loud alarm and a very bright blinking red light!

Detecting Failure

How does a control system detect a failure? The scenarios we've looked at assume that a system failure can be detected, which is fortunately the case for most failures. Many failure detection methods are very simple. I'll expand on some common methods I've used and introduce a few others as well.

Communication watchdogs: one way a control system can detect when another system fails is to test the rate at which that system sends messages to it. If a control system that sends data to a redundant system goes silent (ceases communications), a general assumption can be made that the system has failed. This failure could be in the control system itself, or it might be in the communications link between primary and redundant controllers (for example, a network cable is cut). The system that detects the silence would typically perform a ``communication watchdog event'', which may be anything from triggering an alarm to turning off the control system.

Redundant sensors: recall from the first article that a control system's I/O unit receives signals from sensors, for example a temperature probe or door contact. Detecting sensor failure can sometimes be a bit difficult. For example, if a sensor measuring the temperature of a fish tank reported a value of -100°F (-73°C) or 350°F (176°C), we could deduce that we have frozen fish, steamed bass or a faulty sensor. Of course, these values don't make sense, so we could apply a ``sanity check'' to the reported value to make sure it falls within a range of realistic temperatures. Another method to address sensor failure is to add a second, redundant sensor and compare its value with the primary sensor. When readings from the two sensors don't agree, you know there's something wrong with one of them. To determine which sensor is correct, however, would actually require a third sensor. With the third reading, the control system can effectively ``vote'' for the correct value.

Additional I/O points: adding I/O points to the control system is another way to guard against system failure. For example, an I/O output controlling a light may have two additional sensors attached to its circuit. One sensor can monitor whether voltage is available for the light, and the other sensor can monitor the amount of power the light consumes. This way, the light can be monitored for bulb failure (circuit voltage good, but no power consumption) or a blown circuit breaker (no voltage is available). This system could possibly also detect more unusual conditions, for example if the light is consuming too much power. If the lightbulb fails, the system could report a ``circuit failure'' or ``bulb failure'' alarm. The alarm could even suggest the maintenance locations and parts needed to repair the failure.

Single point failures: single point failures are perhaps the most troublesome kind of system failures. If the water supply in our sprinkler system fails, for example, we can't water. There's really no practical way to provide a backup water supply, so this would be considered a single point failure. Any system design may have a few of these types of situations; despite adequate planning, they are unavoidable. I typically handle them by listing single point failures in a document and describing why they hold such a status. In the case of the failed water supply, for example, no water supply means that plants won't be watered. This particular single point failure may prove catastrophic for the plants over the long term but doesn't represent a physical hazard to operators and other customers.

The existence of potential single point failures is sometimes due to budget considerations. If I had a water reservoir, I could install an alternate water supply. Clearly this is too expensive, so in designing the sprinkler control system, I chose to allow such a potential failure to exist. Bear in mind that every control system has a single point failure. For example, every system needs electricity to operate. Backup generators can cover short outages, but over a longer period this backup generation will eventually fail due to fuel shortage or generator failure. What constitutes a single point failure is ultimately a question of how broadly you look at a control system's operation.

Failure detection comes at a price. All these methods to detect and avoid system failure require extra software, hardware and/or labor. By now it should be clear that designing a control system to tolerate failure can be expensive, and that sometimes cost or practical considerations make it necessary to allow certain single point failures to exist.