It Is Time to Rethink Disaster Recovery
On April 19, 1995, Timothy McVeigh and others destroyed the Murrah federal building in Oklahoma City, Oklahoma. It was, to date, the worst case of terrorism in the United States since the Civil War. On that day a number of things changed, but the biggest lessons were not well learned.
On September, 11, 2001, Al-Qaeda crashed air craft into the financial district of New York City and the Pentagon in Washington, DC. It was the worst case of terrorism to date since Oklahoma City. A number of things changed, but some of the largest lessons were still not well learned.
Since early 2003, a strain of the influenza virus (H5N1) has been making its way around the world. Call it bird (avian) flu, call it swine flu (which is really a different strain, H1N1, and more virulent to humans), but in the late summer of 2009, it is still with us and according to the World Health Organization we are officially in the grips of a pandemic. Perhaps now is the time to review the lessons of both Oklahoma City and September 11, because a full-blown influenza outbreak could be more telling than either Oklahoma or September 11 of just how prepared our systems really are.
Following the bombing of the Murrah building, a long investigation ensued. During this time a number of companies went out of business. Following the collapse of the Twin Towers, a long investigation ensued and a number of companies went out of business. In the event of a pandemic, it is safe to say a number of businesses will go out of business. Why? Not because of loss of life, but because the people working for these companies could not reach their facilities, and the IT infrastructure, to conduct the routine business they were employed to do, everything from CPAs to fulfillment houses. The damage done in 1995 closed several city blocks in Oklahoma City and a number of companies, mostly small, lost access to their facilities during this time because it was a crime scene. Similarly, large parts of downtown New York were cordoned off for safety reasons, as well as because it was a crime scene. A pandemic flu could have similar implication without a single piece of yellow tape.
As an IT architect, it is my job to build a robust, redundant system. But, like most, my assumptions were based on the availability of people to be able to go to the disaster recovery site or take the tapes to XYZ recovery company or make sure my disaster site is x number of miles away from my primary site. These are some of the lessons, and the financial industry learned them and executed them quite successfully in the days following September 11, 2001, but in every picture and description I saw, makeshift tables were layered with machines and wires, clearly set up on the fly by IT professionals – in many cases after working long hours to get the job done. Disaster recovery of that scale worked. But what happens when the disaster is not a loss of systems, but a loss of access to the systems and a loss of the manpower to run them?
In the event of a pandemic, the experts have made the following predictions. First, absenteeism could be as high as 40%. For an IT staff of 10, that is 4 people out sick, either sick themselves or caring for someone who is sick. Second, depending on severity, mandatory separation may be instituted. The standard is six feet. Think about how far from your co-worker your desk is right now. Think about how you get to work, and how you would get to work if you could not sit within six feet of someone. It puts a whole new spin on the issues of mass transit. Finally, depending on the management of your company, rotation schedules might be implemented where half the staff is at home while half the staff is in the office. What sort of impact would that have on your IT services and your ability to manage your IT infrastructure? And are you ready for the level of remote access requests that will come flooding into the department and the issues of fulfilling these requests?
As I have said a number of times, those of us who work in IT just cannot win. When things are humming along smoothly, the bean counters are wondering why they are paying us, and when things are crashing down around you, the bean counters are wondering why they are paying us. In tight times, IT is almost always the first department to suffer cuts. Usually, those cut are at the top and the bottom of the stack, leaving those in the middle to bear the load, often without being properly briefed on the various back doors, trap doors and the ever popular what does that box do?. In the late 1990s and early 2000s, a number of companies, in cost cutting moves, dumped real estate and went to remote access. Over the last five odd years, those telecommuting trends have reversed as management and employees want to be seen as valuable (and thus remain employed), and the communications lines have been slashed as a useless expense, without the forethought of disaster preparedness. As IT people, we are beholden to the budgets. Most of us work for companies that are more concerned with the quarterly stock price and how it can be boosted for the next quarter, with very little long term strategic planning being done. But it does not have to be this way.
It is our responsibility to exercise the disaster recovery plans. So we have the opportunity to apply new tactics to the disaster recovery scenarios. Suggest that the next DR test include a 40% staff cut. Roll dice, generate the names randomly, whatever works for you, and tell them just not to show up. Can you put the systems back on line? What happens if you cannot get the tapes to the DR site? What happens if your remote access systems do not work? These are only some of the things we should be thinking about and preparing for.
As IT professionals, we tend to get bore sighted on hardware and software, in many cases down in the weeds so deep we do not see how all the parts go together, or what other parts are needed, or as one former boss used to remind me, for want of a nail…. So, as we sit in the middle of hurricane season, with tornadoes popping up in unusual places and with increased ferocity, remember that the winter is coming, that there are other concerns out there and we should be considering an all-hazards approach in our disaster planning. And sometimes that means nothing happens to the equipment.
- Resurrecting the Armadillo
- High-Availability Storage with HA-LVM
- Real-Time Rogue Wireless Access Point Detection with the Raspberry Pi
- DNSMasq, the Pint-Sized Super Dæmon!
- Localhost DNS Cache
- March 2015 Issue of Linux Journal: System Administration
- Days Between Dates: the Counting
- The Usability of GNOME
- Linux for Astronomers
- You're the Boss with UBOS