Automating the Physical World with Linux, Part 3: Designing around System Failure

Bryce examines some of the causes of system failure and gives some tips on how to avoid it.

This is the last in a series of articles introducing the field of control automation and its use with Linux. In this final article, I'll introduce the concept of system failure and present some ways to design around it. The importance of preparing an automation system to deal with the unexpected cannot be overstated; high-throughput hardware and embedded Linux kernels let us build powerful automation systems, but not planning for when these systems fail could lead to catastrophe.

The first two articles in the series established the ease with which Linux can be used to implement control automation systems. In the first article (see the May/June 2001 issue of ELJ), we saw a simple, Linux-based sprinkler control system and a temperature control system. Both systems used control algorithms based upon how I actually performed these tasks manually. The first article also introduced the I/O unit: the hardware that interfaces an embedded controller with the external world so the controller can acquire data from and send commands to a device in the physical world.

The second article (see the July/August 2001 issue of ELJ) discussed how to integrate control functions through coordination. Individual control tasks can be organized to solve a larger problem or to provide an orchestrated action. A hypothetical, lavish resort was introduced to demonstrate coordinated actions among lighting, access and other control systems. For example, the lawn area for special events has sprinkler controls that not only irrigate the grass automatically, but also coordinate with lighting and access-control systems to prevent resort guests from getting wet.

System Failure

Fundamentally, a control system automates a physical task, such as watering a lawn, so it can occur without the need for human intervention. This automation reduces or eliminates the factor of human error and generally means that the task is performed regularly and reliably. However, it also means that a human operator usually isn't present to respond to problems that may occur. Similarly, networking multiple control systems together allows highly complex actions to be performed with the same regularity and precision. Such a highly integrated and coordinated system, however, further distances the human operator from the tasks being performed.

With control systems designed to reduce the need for a human operator, chances increase that a system failure may occur unnoticed and result in a problem. Depending on the control system's application (that is, the tasks that are being controlled), system failures may be catastrophic, causing financial loss, property damage and personal injury. System failure is a statistic that every control-system designer must consider.

For a particular application, the likelihood that a system failure may occur (and the potential results if it doesn't) justifies the amount of effort put into designing fault tolerance into a control system. For example, a sprinkler system that stays on for two days will lead to a higher water bill but may not result in property damage or personal injury (except for the loss of some plants from overwatering).

Detection and recovery are viable options in addressing system failures. Additional hardware can be added to oversee a system; usually the additional hardware costs are insignificant compared to the cost of a single failure that goes unnoticed. Hardware that is added to oversee the system may provide not just some kind of fail-safe recovery but also alert personnel.

System Failure Categories

There are two general categories of system failure: failures related to the design of the control system itself and life-cycle failures of the physical system. (Since we're focusing on autonomous control systems, I'm excluding the category of system failure due to operator error.) Failures related to the design of the control system may be due to software design flaws, improperly installed and calibrated devices, or control algorithms that are incorrect or inadequate for the tasks being controlled. Simulation and validation is the solution for detecting these problems.

A life-cycle failure of the physical system essentially means that some part of the control system has broken. Obviously this covers a wide range of control-system elements: power supply, embedded controller(s), infrastructure, cables, sensors, actuators and other components. Maintenance is typically the solution for avoiding life-cycle failures.

Figure 1. Categories of System Failures