More on Contingency Plans

A couple of weeks ago, I tangentially mentioned the need for contingency plans. Today, I want to look at them a little more closely. My current job is about as far away from Continuity of Operations (COOP) and disaster recovery (DR) as you can get, yet I still deal in the issues of disasters, and preventing them, both professionally and personally. As some of you know, I am an amateur radio operator, specializing in emergency communications. As such I spend a lot of personal time sitting in meetings with other department heads, discussing the very real issues of continuity of government, risk assessment and mitigation and recovery in ways that many in Information Technology will never have to consider. Most of the topics revolve reducing life threatening risk (you will note I said reduce, not eliminate).

In Information Technology, in most cases, if there is a system failure, someone will not die (yes, yes, those of you who work in health care have a completely different level of requirements) but that does not mean that we should not make sure that our plans do not take into account that there is risk in everything we do, more so if we are managing systems responsible for running other systems. Last Tuesday, this was hammered home to me in a very personal manner. I commute to my job by train. The train shares the tracks with freights and Amtrak, which means that a breakdown in any one of the intricate mesh of systems that is the rail system in the United States can cause havoc. Last Tuesday, as anyone who rides the rails knows, was one of those fateful days.

The CSX railroad, which controls rails from Boston to Florida, lost the use of their dispatch system. This is a teletype system that every engineer relies on to get their "traffic orders" that list equipment function level, signal outages, rail conditions, etc (as it was explained to us lay people) and even on a short run like the train I take, it can run up to 10 pages that EACH train must have. Of course, the first question is why is this on paper, but ignoring that for the moment, the loss of this system's capabilities is catastrophic. It essentially froze many trains in place. In my corner of the world, that means the impact was several thousand people. Most of these people ended up having to take to the roads, which in the Washington, DC area only means chaos and traffic in an area where chaos and traffic are routine. When these sorts of delays occur, it costs money. The train company has to pay for people to take the Subway, people have to pay to drive and park in addition to what they have paid for their train ticket. And I am only aware of some of the local costs. I would hazard to guess that this outage was very expensive, not just for my local train system but for CSX.

The system was up and operation in a fairly short period of time. Did they implement a disaster recovery plan? Roll back a patch? I do not know, but they got the system back up and running.

When we as systems people design systems for disaster recover or continuity of operations, we are usually looking at downtime. How long will the system be down? We really should be looking at cost and effect. Sadly, many companies cannot tell you the cost of downtime. This is especially true for service companies. I expect that CSX could tell me to the penny how much each minute of downtime cost them, but does that include the costs borne by secondary and tertiary systems impacted? Was the cost high enough that the outage could have been engineered to prevent it? This, of course, is the other side of the coin. Most of us, given the time and the equipment could build a system that was as close to bullet-proof as humanly possible, but the costs would probably be so high as to not be within the realm of reality. So the trade-offs are made. We have all been there and argued for the extra cluster, only to have the accountants tell us that for that 1% chance, it is not worth spending the extra half-million dollars, despite a return on investment that would prevent the company from losing two million dollars.

So we build contingency plans, write disaster recovery procedures (that are not exercised enough, and are not known by enough people) and cross our fingers that tomorrow we will not find our company on the front page of the Wall Street Journal.

______________________

David Lane, KG4GIY is a member of Linux Journal's Editorial Advisory Panel and the Control Op for Linux Journal's Virtual Ham Shack

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState