More on Contingency Plans
A couple of weeks ago, I tangentially mentioned the need for contingency plans. Today, I want to look at them a little more closely. My current job is about as far away from Continuity of Operations (COOP) and disaster recovery (DR) as you can get, yet I still deal in the issues of disasters, and preventing them, both professionally and personally. As some of you know, I am an amateur radio operator, specializing in emergency communications. As such I spend a lot of personal time sitting in meetings with other department heads, discussing the very real issues of continuity of government, risk assessment and mitigation and recovery in ways that many in Information Technology will never have to consider. Most of the topics revolve reducing life threatening risk (you will note I said reduce, not eliminate).
In Information Technology, in most cases, if there is a system failure, someone will not die (yes, yes, those of you who work in health care have a completely different level of requirements) but that does not mean that we should not make sure that our plans do not take into account that there is risk in everything we do, more so if we are managing systems responsible for running other systems. Last Tuesday, this was hammered home to me in a very personal manner. I commute to my job by train. The train shares the tracks with freights and Amtrak, which means that a breakdown in any one of the intricate mesh of systems that is the rail system in the United States can cause havoc. Last Tuesday, as anyone who rides the rails knows, was one of those fateful days.
The CSX railroad, which controls rails from Boston to Florida, lost the use of their dispatch system. This is a teletype system that every engineer relies on to get their "traffic orders" that list equipment function level, signal outages, rail conditions, etc (as it was explained to us lay people) and even on a short run like the train I take, it can run up to 10 pages that EACH train must have. Of course, the first question is why is this on paper, but ignoring that for the moment, the loss of this system's capabilities is catastrophic. It essentially froze many trains in place. In my corner of the world, that means the impact was several thousand people. Most of these people ended up having to take to the roads, which in the Washington, DC area only means chaos and traffic in an area where chaos and traffic are routine. When these sorts of delays occur, it costs money. The train company has to pay for people to take the Subway, people have to pay to drive and park in addition to what they have paid for their train ticket. And I am only aware of some of the local costs. I would hazard to guess that this outage was very expensive, not just for my local train system but for CSX.
The system was up and operation in a fairly short period of time. Did they implement a disaster recovery plan? Roll back a patch? I do not know, but they got the system back up and running.
When we as systems people design systems for disaster recover or continuity of operations, we are usually looking at downtime. How long will the system be down? We really should be looking at cost and effect. Sadly, many companies cannot tell you the cost of downtime. This is especially true for service companies. I expect that CSX could tell me to the penny how much each minute of downtime cost them, but does that include the costs borne by secondary and tertiary systems impacted? Was the cost high enough that the outage could have been engineered to prevent it? This, of course, is the other side of the coin. Most of us, given the time and the equipment could build a system that was as close to bullet-proof as humanly possible, but the costs would probably be so high as to not be within the realm of reality. So the trade-offs are made. We have all been there and argued for the extra cluster, only to have the accountants tell us that for that 1% chance, it is not worth spending the extra half-million dollars, despite a return on investment that would prevent the company from losing two million dollars.
So we build contingency plans, write disaster recovery procedures (that are not exercised enough, and are not known by enough people) and cross our fingers that tomorrow we will not find our company on the front page of the Wall Street Journal.
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- Profiles and RC Files
- Astronomy for KDE
- Maru OS Brings Debian to Your Phone
- Understanding Ceph and Its Place in the Market
- Snappy Moves to New Platforms
- Git 2.9 Released
- What's Our Next Fight?
- OpenSwitch Finds a New Home
- The Giant Zero, Part 0.x
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide