Tales From the Server Room: Zoning Out
Sometimes events and equipment conspire against you and your team to cause a problem. Occasionally, however, it's lack of understanding or foresight that can turn around and bite you. Unfortunately, this is a tale of where we failed to spot all the possible things that might go wrong.
Flashback...
It was 2006, and we were just getting our feet wet with piloting a new server architecture for our company. We'd just received our first fully populated Hewlett-Packard blade chassis (a P-Class chassis with eight dual-core blades, for those of you who're savvy with that type of gear), a new EMC Storage Area Network (SAN) and three VMware ESX licenses. We had just finished converting a fair amount of the development network over to the VMware environment using a Physical-to-Virtual (P2V) migration, and things were going quite well. Matter of fact, many of the people in the company didn't quite understand exactly the improvements we were making to the systems, but they did notice the performance boost of going from machines that were something like single-processor Pentium 4-class servers with IDE disks to a dual-core Opteron where the storage was backed by the speed of the Fibre Channel SAN. In all, things were going quite well, and the feedback we'd received to date fueled a rather rapid switch from the aging physical architecture to a much faster virtual machine architecture.
Background
Before we dive into the story, a couple bits of background information will become very important later on. As I said, we'd received eight dual-core blades, but only three of them at that time were set aside for VMware hosts. The rest were slated to become powerful physical machines—Oracle servers and the like. All these new blades were configured identically: they each had 16GB of RAM, two dual-core Opteron processors, two 300GB disks and Fibre Channel cards connected to the shiny new EMC SAN. With respect to the SAN, since we were devoting this SAN strictly to the blade servers, the decision was made not to add the complexity of zoning the SAN switch. (Zoning a SAN switch means that it is set up to allow only certain hosts to access certain disks.) The last tidbit relates to kickstart.
Both Kyle and I have written a few articles on the topic of kickstarting and automated installation, so by now you're probably aware that we're fans of that. However, this was 2006, and we both were getting our feet wet with that technology. We'd inherited a half-set-up kickstart server from the previous IT administration, and we slowly were making adjustments to it as we grew more knowledgeable about the tech and what we wanted it to do.
[Kyle: Yes, the kickstart environment technically worked, but it required that you physically walk up to each machine with a Red Hat install CD, boot from it, and manually type in the full HTTP path to the kickstart file. I liked the idea of kicking a machine without getting up from our desks, so the environment quickly changed to PXE booting among a number of other improvements. That was convenient, because those blades didn't have a CD-ROM drive.]
Getting back to the story...we'd moved a fair amount of the development and corporate infrastructure over to the VMware environment, but we still had a demand for high-powered physical machines. We'd gotten a request for a new Oracle database machine, and since they were the most powerful boxes in the company at the time, with connections to the Storage Area Network, we elected to make one of the new blades an Oracle box.
As my imperfect memory recalls, Kyle fired up the lights-out management on what was to be the new Oracle machine and started the kickstart process, while I was doing something else—it could have been anything from surfing Slashdot to filling out some stupid management paperwork. I don't remember, and it's not critical to the story, as about 20 minutes after Kyle kickstarted the new Oracle blade, both of our BlackBerries started beeping incessantly.
[Kyle: Those of you who worked (or lived) with us during that period might say, "Weren't your BlackBerries always beeping incessantly?" Yes, that's true, but this time it was different: one, we were awake, and two, we actually were in the office.]
Trouble in Paradise
We both looked at our BlackBerries as we started getting "host down" alerts from most of the machines in the development environment. About that time, muttering could be heard from other cubicles, too: "Is the network down? Hey, I can't get anywhere." I started getting that sinking feeling in the pit of my stomach as Kyle and I started digging into the issue.
Sure enough, as we started looking, we realized just about everything was down. Kyle fired up the VMware console and tried restarting a couple virtual machines, but his efforts were met with "file not found" errors from the console upon restart. File not found? That sinking feeling just accelerated into free-fall. I started looking along with Kyle and realized that all the LUNs (disks where the virtual machines reside) just flat out stopped being available to each VM host.
[Kyle: It's hard to describe the sinking feeling. I was relatively new to SAN at the time and was just realizing how broad a subject it is in its own right. SAN troubleshooting at a deep level was not something I felt ready for so soon, yet it looked like unless we could figure something out, we had a large number of servers that were gone for good.]
I jumped on the phone and called VMware while Kyle continued troubleshooting. After a few minutes on the line, the problem was apparent. The LUNs containing the virtual machines had their partition tables wiped out. We luckily could re-create them, and after a quick reboot of each VM host, we were back in business, although we were very worried and confused about the issue.
[Kyle: So that's why that sinking feeling felt familiar. It was the same one I had the first time I accidentally nuked the partition table on my own computer with a bad dd command.]
Our worry and concern jumped to near-panic when the issue reared its head a second time, however, under similar circumstances. A physical machine kickstart wound up nuking the partition table on the SAN LUNs that carried the virtual machine files. We placed another call to VMware, and after some log mining, they determined that it wasn't a bug in their software, but something on our end that was erasing the partition table.
A Light Dawns
Kyle and I started to piece together the chain of events and realized that
each time this occurred, it was preceded by a kickstart of a blade server.
That led us to look at the actual kickstart control file we were using,
and it turned out there was one line in there that caused the whole problem.
The directive clearpart --all --initlabel would erase the partition table
on all disks attached to a particular host, which made sense if the
server in question had local disks, but these blades were attached to the
SAN, and the SAN didn't have any zoning in place to protect against this.
As it turns out, the system did exactly what it
was set up to do. If we had
placed the LUNs in zones, this wouldn't have happened, or if we'd have
audited the kickstart control file and thought about it in advance, the
problem wouldn't have happened either.
[Kyle: Who would have thought that kickstart would become yet another one of those UNIX genie-like commands like dd that do exactly what you say. We not only placed the LUNs in zones, but we also made sure that the clearpart directive was very specific to clear out only the disks we wanted—lucky for us, those HP RAID controllers show up as /dev/cciss/ devices, so it was easy to write the restriction.]
Lessons Learned
We learned a couple things that day. First was the importance of zoning your SAN correctly. The assumption we were operating under—that these boxes would all want to access the SAN and, therefore, zones were unnecessary—was flat out wrong. Second, was the importance of auditing and understanding work that other sysadmins had done prior and understanding how that work would affect the new stuff we were implementing. Needless to say, our SAN always was zoned properly after that.
Bill Childers is the Virtual Editor for Linux Journal. No one really knows what that means.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- RSS Feeds
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- Developer Poll
- Dart: a New Web Programming Experience
- What's the tweeting protocol?
- May 2013 Issue of Linux Journal: Raspberry Pi
- Reply to comment | Linux Journal
1 hour 48 min ago - Reply to comment | Linux Journal
2 hours 35 min ago - Web Hosting IQ
4 hours 9 min ago - Thanks for taking the time to
5 hours 45 min ago - Linux is good
7 hours 43 min ago - Reply to comment | Linux Journal
8 hours 43 sec ago - Web Hosting IQ
8 hours 30 min ago - Web Hosting IQ
8 hours 31 min ago - Web Hosting IQ
8 hours 31 min ago - Reply to comment | Linux Journal
11 hours 32 min ago



Comments
I need to say, as a great
I need to say, as a great deal as I enjoyed reading what you had to say, I couldnt assist but lose track of time following a while.
What a fascinating
What a fascinating discussion! It just proves that SF is a broad (ahem) church, with room for all viewpoints about Jews and Israel. I didn’t realise that China Mieville was such an anti-Zionist, and I’m very disappointed, compartilhar noticias na web as he’s a great writer – but then the same goes for Iain M. Banks. Mieville’s ‘The City and The City’ is an excellent parable on Jerusalem – even though one of the characters explicitly rejects that it is so. Also don’t forget Harry Turtledove: much of his work is Jewish-themed. ‘In the Presence of Mine Enemies’ is about Jews living secretly in a victorious Nazi world-empire where they are assumed to have been exterminated, and his ‘Worldwar’ series has reptilian aliens invade Earth just in time to stop the Holocaust – among other things.
zoom
is also important to include the web web host foundation in the guidelines. As the web web host foundation decides the efficiency of the web page. If any web page has a recovery time, even a excellent position will not make much difference. Therefore, you must host the web page with the appropriate web web host foundation for guaranteeing the best efficiency of the web page. A fast and efficient web page will supplement the SEO projects taken and thereby achieving a excellent web existence will be possible within a short. agregador de links
Great
I confess, I’ve not been on this weblog in a long time. nonetheless it was one more delight to read your great articles.ppi.
This website is actually a
This website is actually a great publishing and extremely helpful. my partner and i really enjoy this evaluation people location straight into your blog.http://phentemine375.webs.com/
It was 2006, and we were just
It was 2006, and we were just getting our feet wet with piloting a new server architecture for our company.Jobs in the UK
Nice one
I really appreciate the site for having such nice articles and good collections of information provided here.the article on mental health is very informative which would help many peoples.seo backlinks
Such an occurrence like this
Such an occurrence like this happens far too often in companies that rely heavily in IT and running servers. You never know when a computer or human error will cause everything to just shut down. Other than planning ahead for such problems, you also need a good data recovery system to prevent yourself from permanently losing everything you worked for.
James - http://www.raid-data-recovery-uk.com
Like this post
Thanks for sharing this good article.
Mark
Been there, done that.
I had a similar setup around the same time you did with an HP P-class blade system and an HP EVA 4000. I wasn't blowing away my VMWare LUN partition tables but was fighting performance issues and particular a LUN operation like rescan on one blade causing LUN errors on the other 8 hosts. I think this was in ESX 2.5. Final lesson, zone each host to the SAN independently on each switch.
Great
Brilliant article! It’s so refreshing to see there still exist some real blogs today which are actually worth reading.vanity fair bras.
Awesome
Thank you for this post, I will take what I have learned and Use it towards making my Blog get noticed, or least try :)Ενοικιάσεις Αυτοκινήτων Κρήτη.
Thanks
Thanks for Sharing....
Dharmin