More on Contingency Plans

July 28th, 2008 by David Lane

Your rating: None Average: 4 (3 votes)

A couple of weeks ago, I tangentially mentioned the need for contingency plans. Today, I want to look at them a little more closely. My current job is about as far away from Continuity of Operations (COOP) and disaster recovery (DR) as you can get, yet I still deal in the issues of disasters, and preventing them, both professionally and personally. As some of you know, I am an amateur radio operator, specializing in emergency communications. As such I spend a lot of personal time sitting in meetings with other department heads, discussing the very real issues of continuity of government, risk assessment and mitigation and recovery in ways that many in Information Technology will never have to consider. Most of the topics revolve reducing life threatening risk (you will note I said reduce, not eliminate).

In Information Technology, in most cases, if there is a system failure, someone will not die (yes, yes, those of you who work in health care have a completely different level of requirements) but that does not mean that we should not make sure that our plans do not take into account that there is risk in everything we do, more so if we are managing systems responsible for running other systems. Last Tuesday, this was hammered home to me in a very personal manner. I commute to my job by train. The train shares the tracks with freights and Amtrak, which means that a breakdown in any one of the intricate mesh of systems that is the rail system in the United States can cause havoc. Last Tuesday, as anyone who rides the rails knows, was one of those fateful days.

The CSX railroad, which controls rails from Boston to Florida, lost the use of their dispatch system. This is a teletype system that every engineer relies on to get their "traffic orders" that list equipment function level, signal outages, rail conditions, etc (as it was explained to us lay people) and even on a short run like the train I take, it can run up to 10 pages that EACH train must have. Of course, the first question is why is this on paper, but ignoring that for the moment, the loss of this system's capabilities is catastrophic. It essentially froze many trains in place. In my corner of the world, that means the impact was several thousand people. Most of these people ended up having to take to the roads, which in the Washington, DC area only means chaos and traffic in an area where chaos and traffic are routine. When these sorts of delays occur, it costs money. The train company has to pay for people to take the Subway, people have to pay to drive and park in addition to what they have paid for their train ticket. And I am only aware of some of the local costs. I would hazard to guess that this outage was very expensive, not just for my local train system but for CSX.

The system was up and operation in a fairly short period of time. Did they implement a disaster recovery plan? Roll back a patch? I do not know, but they got the system back up and running.

When we as systems people design systems for disaster recover or continuity of operations, we are usually looking at downtime. How long will the system be down? We really should be looking at cost and effect. Sadly, many companies cannot tell you the cost of downtime. This is especially true for service companies. I expect that CSX could tell me to the penny how much each minute of downtime cost them, but does that include the costs borne by secondary and tertiary systems impacted? Was the cost high enough that the outage could have been engineered to prevent it? This, of course, is the other side of the coin. Most of us, given the time and the equipment could build a system that was as close to bullet-proof as humanly possible, but the costs would probably be so high as to not be within the realm of reality. So the trade-offs are made. We have all been there and argued for the extra cluster, only to have the accountants tell us that for that 1% chance, it is not worth spending the extra half-million dollars, despite a return on investment that would prevent the company from losing two million dollars.

So we build contingency plans, write disaster recovery procedures (that are not exercised enough, and are not known by enough people) and cross our fingers that tomorrow we will not find our company on the front page of the Wall Street Journal.
__________________________
David Lane is a member of Linux Journal's Reader Advisory Board.


Special Magazine Offer -- Free Gift with Subscription
Receive a free digital copy of Linux Journal's System Administration Special Edition as well as instant online access to current and past issues. CLICK HERE for offer

Linux Journal: delivering readers the advice and inspiration they need to get the most out of their Linux systems since 1994.

Post new comment

Please note that comments may not appear immediately, so there is no need to repost your comment.
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <pre> <ul> <ol> <li> <dl> <dt> <dd> <i> <b>
  • Lines and paragraphs break automatically.

More information about formatting options

Newsletter

Each week Linux Journal editors will tell you what's hot in the world of Linux. You will receive late breaking news, technical tips and tricks, and links to in-depth stories featured on www.linuxjournal.com.
Sign up for our Email Newsletter

Tech Tip Videos

From the Magazine

July 2009, #183

News Flash: Linux Kernel 3.0 to include an on-the-go Expresso machine interface! Ok, maybe not, but Linux is definitely going mobile, from phones to e-readers. Find out more inside about Android, the Kindle 2, the Western Digital MyBook II, The Bug, and Indamixx (a portable recording studio). And if you've gone mobile and you been wanting more Emacs in your life then check out Conkeror.


To compliment the mobile we've got the stationary: parsing command line options with getopt, checking your Ruby code with metric_fu, and building a secure Squid proxy. How is this stationary you ask? What can we say? It's not. We just wanted to see if anybody actually read this part of the page :) .


All this and more, and all you have to do is get your hot sweaty hands on the latest copy of Linux Journal.





Read this issue