Tales from the Server Room - Panic on the Streets of London
Here I was, thousands of miles away from home, breathing in the warm exhaust from a rack full of servers, trying to bring a stubborn server back to life. I wasn't completely without options just yet. I had a hunch the problem was related to DHCP, so I pored through the logs on my DHCP server and confirmed that, yes, I could see leases being granted to the server, and, yes, there were ample spare leases to hand out. I even restarted the DHCP service for good measure.
Finally, I decided to watch the DHCP logs during a kickstart. I would start the kickstart process, see the machine gets its lease, either the first time or when I told it to retry, then fail later on in the install. I had a log full of successful DHCP requests with no explanation of why it didn't work. Then I had my first real clue: during one of the kickstarts, I noticed that the server had actually requested a DHCP lease multiple times.
Even with this clue, I started running out of explanations. The DHCP server seemed to be healthy. After all, my laptop was able to use it just fine, and I had a log file full of successful DHCP requests. Here I turned to the next phase of troubleshooting: the guessing game. I swapped cables, changed what NIC was connected and even changed the switch port. After all of that, I still had the same issue. I had kickstarted the machine so many times now, I had the entire list of arguments memorized. I was running out of options, patience and most important, time.
[Bill: I remember seeing an e-mail or two about this. I was comfortably ensconced at the corporate HQ in California, and you were working on this while I was asleep. I'm sure I'd have been able to help more if I'd been awake. I'm glad you were on the case though.]
I was now at the next phase of troubleshooting: prayer. Somewhere around this time, I had my big breakthrough. While I was swapping all the cables around, I noticed something interesting on the switch—the LEDs for the port I was using went amber when I first plugged in the cable, and it took quite a bit of time to turn green. I noticed that the same thing happened when I kickstarted my machine and again later on during the install. It looked as though every time the server brought up its network interface, it would cause the switch to reset the port. When I watched this carefully, I saw during one install that the server errored out of the install while the port was still amber and just before it turned green!
What did all of this mean? Although it was true that the DHCP server was functioning correctly, DHCP requests themselves typically have a 30-second timeout before they give an error. It turned out that this switch was just hovering on the 30-second limit to bring a port up. When it was below 30 seconds I would get a lease; when it wasn't, I wouldn't. Even though I found the cause of the problem, it didn't do me much good. Because the installer appeared to reset its port at least three times, there was just about no way I was going to be able to be lucky enough to get three consecutive sub-30-second port resets. I had to figure out another way, yet I didn't manage the networking gear, and the people who did wouldn't be awake for hours (see sidebar).
The ultimate cause of the problem was that every time the port was reset, the switch recalculated the spanning tree for the network, which sometimes can take up to a minute or more. The long-term solution was to make sure that all ports we intended to kickstart were set with the portfast option so that they came up within a few seconds.
[Bill: One of the guys I worked with right out of college always told me “Start your troubleshooting with the cabling.” When troubleshooting networking issues, it's easy to forget about things that can affect the link-layer, so I check those as part of the cabling now. It doesn't take long and can save tons of time.]
I started reviewing my options. I needed some way to take the switch out of the equation. In all of my planning for this trip, I happened to bring quite a toolkit of MacGyver sysadmin gear, including a short handmade crossover cable and a coupler. I needed to keep the original kickstart server on the network, but I realized if I could clone all of the kickstart configurations, DHCP settings and package repositories to my laptop, I could connect to the machine with a crossover cable and complete the kickstart that way.
After a few apt-gets, rsyncs, and some tweaking and tuning on the server room floor, I had my Frankenstein kickstart server ready to go. Like I had hoped, the kickstart completed without a hitch. I was then able to repeat the same task on the other two servers in no time and was relieved to send the e-mail to the rest of the team saying that all of their servers were ready for them, right on schedule. On the next day of the trip, I was able to knock out all of my tasks early so I could spend the final provisional day sightseeing around London. It all goes to show that although a good plan is important, you also should be flexible for when something inevitably goes outside your plan.
[Bill: I'm glad you planned like you did, but it also highlights how important being observant and having a good troubleshooting methodology are. Although you were able to duct-tape a new kickstart server out of your laptop, you could have spent infinitely longer chasing the issue. It's just as important to know when to stop chasing a problem and put a band-aid in place as it is to fix the problem in the first place.]
Kyle Rankin is a Systems Architect in the San Francisco Bay Area and the author of a number of books, including The Official Ubuntu Server Book, Knoppix Hacks and Ubuntu Hacks. He is currently the president of the North Bay Linux Users' Group.
Bill Childers is an IT Manager in Silicon Valley, where he lives with his wife and two children. He enjoys Linux far too much, and he probably should get more sun from time to time. In his spare time, he does work with the Gilroy Garlic Festival, but he does not smell like garlic.
Kyle Rankin is a systems architect; and the author of DevOps Troubleshooting, The Official Ubuntu Server Book, Knoppix Hacks, Knoppix Pocket Reference, Linux Multimedia Hacks, and Ubuntu Hacks.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- RSS Feeds
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- New Products
- Tech Tip: Really Simple HTTP Server with Python
- Connecting Android device to desktop Linux via USB
12 min 6 sec ago - Find new cell phone and tablet pc
1 hour 10 min ago - Epistle
2 hours 39 min ago - Automatically updating Guest Additions
3 hours 47 min ago - I like your topic on android
4 hours 34 min ago - Reply to comment | Linux Journal
4 hours 55 min ago - This is the easiest tutorial
11 hours 9 min ago - Ahh, the Koolaid.
16 hours 48 min ago - git-annex assistant
22 hours 47 min ago - direct cable connection
23 hours 10 min ago




Comments
When will you publish the next story?
Great story, I'm sure you have many more projects to talk about. I love the technical aspect as well, don't be afraid to get deeper in tech detail as an aspiring jr sysadmin i'd love to learn from others experience. Thanks for sharing can't wait for more.