Tales From the Server Room: Zoning Out

Sometimes events and equipment conspire against you and your team to cause a problem. Occasionally, however, it's lack of understanding or foresight that can turn around and bite you. Unfortunately, this is a tale of where we failed to spot all the possible things that might go wrong.

Flashback...

It was 2006, and we were just getting our feet wet with piloting a new server architecture for our company. We'd just received our first fully populated Hewlett-Packard blade chassis (a P-Class chassis with eight dual-core blades, for those of you who're savvy with that type of gear), a new EMC Storage Area Network (SAN) and three VMware ESX licenses. We had just finished converting a fair amount of the development network over to the VMware environment using a Physical-to-Virtual (P2V) migration, and things were going quite well. Matter of fact, many of the people in the company didn't quite understand exactly the improvements we were making to the systems, but they did notice the performance boost of going from machines that were something like single-processor Pentium 4-class servers with IDE disks to a dual-core Opteron where the storage was backed by the speed of the Fibre Channel SAN. In all, things were going quite well, and the feedback we'd received to date fueled a rather rapid switch from the aging physical architecture to a much faster virtual machine architecture.

Background

Before we dive into the story, a couple bits of background information will become very important later on. As I said, we'd received eight dual-core blades, but only three of them at that time were set aside for VMware hosts. The rest were slated to become powerful physical machines—Oracle servers and the like. All these new blades were configured identically: they each had 16GB of RAM, two dual-core Opteron processors, two 300GB disks and Fibre Channel cards connected to the shiny new EMC SAN. With respect to the SAN, since we were devoting this SAN strictly to the blade servers, the decision was made not to add the complexity of zoning the SAN switch. (Zoning a SAN switch means that it is set up to allow only certain hosts to access certain disks.) The last tidbit relates to kickstart.

Both Kyle and I have written a few articles on the topic of kickstarting and automated installation, so by now you're probably aware that we're fans of that. However, this was 2006, and we both were getting our feet wet with that technology. We'd inherited a half-set-up kickstart server from the previous IT administration, and we slowly were making adjustments to it as we grew more knowledgeable about the tech and what we wanted it to do.

[Kyle: Yes, the kickstart environment technically worked, but it required that you physically walk up to each machine with a Red Hat install CD, boot from it, and manually type in the full HTTP path to the kickstart file. I liked the idea of kicking a machine without getting up from our desks, so the environment quickly changed to PXE booting among a number of other improvements. That was convenient, because those blades didn't have a CD-ROM drive.]

Getting back to the story...we'd moved a fair amount of the development and corporate infrastructure over to the VMware environment, but we still had a demand for high-powered physical machines. We'd gotten a request for a new Oracle database machine, and since they were the most powerful boxes in the company at the time, with connections to the Storage Area Network, we elected to make one of the new blades an Oracle box.

As my imperfect memory recalls, Kyle fired up the lights-out management on what was to be the new Oracle machine and started the kickstart process, while I was doing something else—it could have been anything from surfing Slashdot to filling out some stupid management paperwork. I don't remember, and it's not critical to the story, as about 20 minutes after Kyle kickstarted the new Oracle blade, both of our BlackBerries started beeping incessantly.

[Kyle: Those of you who worked (or lived) with us during that period might say, "Weren't your BlackBerries always beeping incessantly?" Yes, that's true, but this time it was different: one, we were awake, and two, we actually were in the office.]

Trouble in Paradise

We both looked at our BlackBerries as we started getting "host down" alerts from most of the machines in the development environment. About that time, muttering could be heard from other cubicles, too: "Is the network down? Hey, I can't get anywhere." I started getting that sinking feeling in the pit of my stomach as Kyle and I started digging into the issue.

Sure enough, as we started looking, we realized just about everything was down. Kyle fired up the VMware console and tried restarting a couple virtual machines, but his efforts were met with "file not found" errors from the console upon restart. File not found? That sinking feeling just accelerated into free-fall. I started looking along with Kyle and realized that all the LUNs (disks where the virtual machines reside) just flat out stopped being available to each VM host.

[Kyle: It's hard to describe the sinking feeling. I was relatively new to SAN at the time and was just realizing how broad a subject it is in its own right. SAN troubleshooting at a deep level was not something I felt ready for so soon, yet it looked like unless we could figure something out, we had a large number of servers that were gone for good.]

I jumped on the phone and called VMware while Kyle continued troubleshooting. After a few minutes on the line, the problem was apparent. The LUNs containing the virtual machines had their partition tables wiped out. We luckily could re-create them, and after a quick reboot of each VM host, we were back in business, although we were very worried and confused about the issue.

[Kyle: So that's why that sinking feeling felt familiar. It was the same one I had the first time I accidentally nuked the partition table on my own computer with a bad dd command.]

Our worry and concern jumped to near-panic when the issue reared its head a second time, however, under similar circumstances. A physical machine kickstart wound up nuking the partition table on the SAN LUNs that carried the virtual machine files. We placed another call to VMware, and after some log mining, they determined that it wasn't a bug in their software, but something on our end that was erasing the partition table.

A Light Dawns

Kyle and I started to piece together the chain of events and realized that each time this occurred, it was preceded by a kickstart of a blade server. That led us to look at the actual kickstart control file we were using, and it turned out there was one line in there that caused the whole problem. The directive clearpart --all --initlabel would erase the partition table on all disks attached to a particular host, which made sense if the server in question had local disks, but these blades were attached to the SAN, and the SAN didn't have any zoning in place to protect against this. As it turns out, the system did exactly what it was set up to do. If we had placed the LUNs in zones, this wouldn't have happened, or if we'd have audited the kickstart control file and thought about it in advance, the problem wouldn't have happened either.

[Kyle: Who would have thought that kickstart would become yet another one of those UNIX genie-like commands like dd that do exactly what you say. We not only placed the LUNs in zones, but we also made sure that the clearpart directive was very specific to clear out only the disks we wanted—lucky for us, those HP RAID controllers show up as /dev/cciss/ devices, so it was easy to write the restriction.]

Lessons Learned

We learned a couple things that day. First was the importance of zoning your SAN correctly. The assumption we were operating under—that these boxes would all want to access the SAN and, therefore, zones were unnecessary—was flat out wrong. Second, was the importance of auditing and understanding work that other sysadmins had done prior and understanding how that work would affect the new stuff we were implementing. Needless to say, our SAN always was zoned properly after that.

______________________

Bill Childers is the Virtual Editor for Linux Journal. No one really knows what that means.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I need to say, as a great

karga's picture

I need to say, as a great deal as I enjoyed reading what you had to say, I couldnt assist but lose track of time following a while.

What a fascinating

jearljam's picture

What a fascinating discussion! It just proves that SF is a broad (ahem) church, with room for all viewpoints about Jews and Israel. I didn’t realise that China Mieville was such an anti-Zionist, and I’m very disappointed, compartilhar noticias na web as he’s a great writer – but then the same goes for Iain M. Banks. Mieville’s ‘The City and The City’ is an excellent parable on Jerusalem – even though one of the characters explicitly rejects that it is so. Also don’t forget Harry Turtledove: much of his work is Jewish-themed. ‘In the Presence of Mine Enemies’ is about Jews living secretly in a victorious Nazi world-empire where they are assumed to have been exterminated, and his ‘Worldwar’ series has reptilian aliens invade Earth just in time to stop the Holocaust – among other things.

zoom

reguesty's picture

is also important to include the web web host foundation in the guidelines. As the web web host foundation decides the efficiency of the web page. If any web page has a recovery time, even a excellent position will not make much difference. Therefore, you must host the web page with the appropriate web web host foundation for guaranteeing the best efficiency of the web page. A fast and efficient web page will supplement the SEO projects taken and thereby achieving a excellent web existence will be possible within a short. agregador de links

Great

thomsonbrown's picture

I confess, I’ve not been on this weblog in a long time. nonetheless it was one more delight to read your great articles.ppi.

This website is actually a

augustenmartin's picture

This website is actually a great publishing and extremely helpful. my partner and i really enjoy this evaluation people location straight into your blog.http://phentemine375.webs.com/

It was 2006, and we were just

diu's picture

It was 2006, and we were just getting our feet wet with piloting a new server architecture for our company.Jobs in the UK

Nice one

Anonymous's picture

I really appreciate the site for having such nice articles and good collections of information provided here.the article on mental health is very informative which would help many peoples.seo backlinks

Such an occurrence like this

James Randall's picture

Such an occurrence like this happens far too often in companies that rely heavily in IT and running servers. You never know when a computer or human error will cause everything to just shut down. Other than planning ahead for such problems, you also need a good data recovery system to prevent yourself from permanently losing everything you worked for.

James - http://www.raid-data-recovery-uk.com

Like this post

marksen's picture

Thanks for sharing this good article.

Mark

Been there, done that.

Jason's picture

I had a similar setup around the same time you did with an HP P-class blade system and an HP EVA 4000. I wasn't blowing away my VMWare LUN partition tables but was fighting performance issues and particular a LUN operation like rescan on one blade causing LUN errors on the other 8 hosts. I think this was in ESX 2.5. Final lesson, zone each host to the SAN independently on each switch.

Great

smith's picture

Brilliant article! It’s so refreshing to see there still exist some real blogs today which are actually worth reading.vanity fair bras.

Awesome

kellybrown's picture

Thank you for this post, I will take what I have learned and Use it towards making my Blog get noticed, or least try :)Ενοικιάσεις Αυτοκινήτων Κρήτη.

Thanks

Dharmin's picture

Thanks for Sharing....

Dharmin

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix