Tales from the Server Room - Panic on the Streets of London
I've always thought it's better to learn from someone else's mistakes than from my own. In this column, Kyle Rankin or Bill Childers will tell a story from their years as systems administrators while the other will chime in from time to time. It's a win-win: you get to learn from our experiences, and we get to make snide comments to each other. Kyle tells the first story in this series.
I was pretty excited about my first trip to the London data center. I had been to London before on vacation, but this was the first time I would visit our colocation facility on business. What's more, it was the first remote data-center trip I was to take by myself. Because I still was relatively new to the company and the junior-most sysadmin at the time, this was the perfect opportunity to prove that I knew what I was doing and could be trusted for future trips.
The maintenance was relatively straightforward. A few machines needed a fresh Linux install, plus I would troubleshoot an unresponsive server, audit our serial console connections, and do a few other odds and ends. We estimated it was a two-day job, but just in case, we added an extra provisional day.
[Bill: If I remember right, I had to fight to get that extra day tacked onto the trip for you. We'd learned from past experience that nothing at that place seemed easy at face value.]
Even with an extra day, I wanted this trip to go smoothly, so I came up with a comprehensive plan. Each task was ordered by its priority along with detailed lists of the various commands and procedures I would use to accomplish each task. I even set up an itemized checklist of everything I needed to take with me.
[Bill: I remember thinking that you were taking it way too seriously—after all, it was just a kickstart of a few new machines. What could possibly go wrong? In hindsight, I'm glad you made all those lists.]
The first day I arrived at the data center, I knew exactly what I needed to do. Once I got my badge and was escorted through multiple levels of security to our colocation cages, I would kickstart each of the servers on my list one by one and perform all the manual configuration steps they needed. If I had time, I could finish the rest of the maintenance; otherwise, I'd leave any other tasks for the next day.
Now, it's worth noting that at this time we didn't have a sophisticated kickstart system in place nor did we have advanced lights-out management—just a serial console and a remotely controlled power system. Although our data center did have a kickstart server with a package repository, we still had to connect each server to a monitor and keyboard, boot from an install CD and manually type in the URL to the kickstart file.
[Bill: I think this experience is what started us down the path of a lights-out management solution. I remember pitching it to the executives as “administering from the Bahamas”, and relaying this story to them was one of the key reasons that pitch was successful.]
After I had connected everything to the first server, I inserted the CD, booted the system and typed in my kickstart URL according to my detailed plans. Immediately I saw the kernel load, and the kickstart process was under way. Wow, if everything keeps going this way, I might even get this done early, I thought. Before I could start making plans for my extra days in London though, I saw the kickstart red screen of death. The kickstart logs showed that for some reason, it wasn't able to retrieve some of the files it needed from the kickstart server.
Great, now I needed to troubleshoot a broken kickstart server. Luckily, I had brought my laptop with me, and the troubleshooting was straightforward. I connected my laptop to the network, eventually got a DHCP lease, pointed the browser to the kickstart server, and sure enough, I was able to see my kickstart configuration files and browse through my package repository with no problems.
I wasn't exactly sure what was wrong, but I chalked it up to a momentary blip and decided to try again. This time, the kickstart failed, but at a different point in the install. I tried a third time, and it failed at the original point in the install. I repeated the kickstart process multiple times, trying to see some sort of pattern, but all I saw was the kickstart fail at a few different times.
The most maddening thing about this problem was the inconsistency. What's worse, even though I had more days to work on this, the kickstart of this first server was the most important task to get done immediately. In a few hours, I would have a team of people waiting on the server so they could set it up as a database system.
Kyle Rankin is a systems architect; and the author of DevOps Troubleshooting, The Official Ubuntu Server Book, Knoppix Hacks, Knoppix Pocket Reference, Linux Multimedia Hacks, and Ubuntu Hacks.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- A Topic for Discussion - Open Source Feature-Richness?
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Home, My Backup Data Center
- What's the tweeting protocol?
- Readers' Choice Awards
- New Products
- RSS Feeds
- Dart: a New Web Programming Experience
- Reply to comment | Linux Journal
10 hours 5 min ago - Reply to comment | Linux Journal
12 hours 38 min ago - Reply to comment | Linux Journal
13 hours 55 min ago - great post
14 hours 30 min ago - Google Docs
14 hours 52 min ago - Reply to comment | Linux Journal
19 hours 41 min ago - Reply to comment | Linux Journal
20 hours 28 min ago - Web Hosting IQ
22 hours 1 min ago - Thanks for taking the time to
23 hours 38 min ago - Linux is good
1 day 1 hour ago




Comments
When will you publish the next story?
Great story, I'm sure you have many more projects to talk about. I love the technical aspect as well, don't be afraid to get deeper in tech detail as an aspiring jr sysadmin i'd love to learn from others experience. Thanks for sharing can't wait for more.