More on Contingency Plans
A couple of weeks ago, I tangentially mentioned the need for contingency plans. Today, I want to look at them a little more closely. My current job is about as far away from Continuity of Operations (COOP) and disaster recovery (DR) as you can get, yet I still deal in the issues of disasters, and preventing them, both professionally and personally. As some of you know, I am an amateur radio operator, specializing in emergency communications. As such I spend a lot of personal time sitting in meetings with other department heads, discussing the very real issues of continuity of government, risk assessment and mitigation and recovery in ways that many in Information Technology will never have to consider. Most of the topics revolve reducing life threatening risk (you will note I said reduce, not eliminate).
In Information Technology, in most cases, if there is a system failure, someone will not die (yes, yes, those of you who work in health care have a completely different level of requirements) but that does not mean that we should not make sure that our plans do not take into account that there is risk in everything we do, more so if we are managing systems responsible for running other systems. Last Tuesday, this was hammered home to me in a very personal manner. I commute to my job by train. The train shares the tracks with freights and Amtrak, which means that a breakdown in any one of the intricate mesh of systems that is the rail system in the United States can cause havoc. Last Tuesday, as anyone who rides the rails knows, was one of those fateful days.
The CSX railroad, which controls rails from Boston to Florida, lost the use of their dispatch system. This is a teletype system that every engineer relies on to get their "traffic orders" that list equipment function level, signal outages, rail conditions, etc (as it was explained to us lay people) and even on a short run like the train I take, it can run up to 10 pages that EACH train must have. Of course, the first question is why is this on paper, but ignoring that for the moment, the loss of this system's capabilities is catastrophic. It essentially froze many trains in place. In my corner of the world, that means the impact was several thousand people. Most of these people ended up having to take to the roads, which in the Washington, DC area only means chaos and traffic in an area where chaos and traffic are routine. When these sorts of delays occur, it costs money. The train company has to pay for people to take the Subway, people have to pay to drive and park in addition to what they have paid for their train ticket. And I am only aware of some of the local costs. I would hazard to guess that this outage was very expensive, not just for my local train system but for CSX.
The system was up and operation in a fairly short period of time. Did they implement a disaster recovery plan? Roll back a patch? I do not know, but they got the system back up and running.
When we as systems people design systems for disaster recover or continuity of operations, we are usually looking at downtime. How long will the system be down? We really should be looking at cost and effect. Sadly, many companies cannot tell you the cost of downtime. This is especially true for service companies. I expect that CSX could tell me to the penny how much each minute of downtime cost them, but does that include the costs borne by secondary and tertiary systems impacted? Was the cost high enough that the outage could have been engineered to prevent it? This, of course, is the other side of the coin. Most of us, given the time and the equipment could build a system that was as close to bullet-proof as humanly possible, but the costs would probably be so high as to not be within the realm of reality. So the trade-offs are made. We have all been there and argued for the extra cluster, only to have the accountants tell us that for that 1% chance, it is not worth spending the extra half-million dollars, despite a return on investment that would prevent the company from losing two million dollars.
So we build contingency plans, write disaster recovery procedures (that are not exercised enough, and are not known by enough people) and cross our fingers that tomorrow we will not find our company on the front page of the Wall Street Journal.
David Lane, KG4GIY is a member of Linux Journal's Editorial Advisory Panel and the Control Op for Linux Journal's Virtual Ham Shack
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- A Topic for Discussion - Open Source Feature-Richness?
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Home, My Backup Data Center
- What's the tweeting protocol?
- One Hand Slapping
- The Secret Password Is...
- Trying to Tame the Tablet
- RSS Feeds
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.



5 hours 52 min ago
8 hours 25 min ago
9 hours 42 min ago
10 hours 17 min ago
10 hours 40 min ago
15 hours 28 min ago
16 hours 15 min ago
17 hours 49 min ago
19 hours 25 min ago
21 hours 23 min ago