It's Always DNS's Fault!

It's always better to learn from someone else's mistakes than from your own. In this column, Kyle Rankin or Bill Childers tells a story from his years as a systems administrator, and the other chimes in from time to time. It's a win-win: you get to learn from their experiences, and they get to make snide comments to each other. Today's episode is narrated by Bill.

Some Days, You're the Pigeon...

I was suffering, badly. We had just finished an all-night switch migration on our production Storage Area Network while I was hacking up a lung fighting walking pneumonia. Even though I did my part of the all-nighter from home, I was exhausted. So when my pager went off at 9am that morning, allowing me a mere four hours of sleep, I was treading dangerously close to zombie territory.

I looked at the pager and saw that someone had pushed the dreaded "Panic Button", a Web-based tool we'd made that would alert the larger IT team to an unknown high-priority issue. I sat up, reeling and asked my wife to begin the caffeine IV drip that would wake me up while I slowly started banging synapses together, hoping for a spark. According to the report, our DNS infrastructure was timing out on a lot of requests, causing overall site slowdown. I had to re-read that e-mail several times for it to sink into my oxygen-and-sleep-deprived brain. How could DNS be timing out, and why hasn't our internal monitoring caught that? We monitored the DNS servers and service levels internally, and if performance was bad, I should have been the first to know. Something smelled really funny, and it wasn't me, despite the pneumonia-induced fever.

[Kyle: I'll pretend I didn't see the "something smelled funny" comment, as it's too easy. The funny thing here was that we had a long-standing tradition of DNS being blamed whenever there was any sort of networking problem. I've said before that people tend to blame the technology they understand least. This case was one of the first times that it actually seemed (at least on the surface) to be a DNS issue.]

I started checking on things as I dialed in to the conference call for this issue. Our monitoring system said nothing was awry, and response times for DNS were normal. I ran a few nslookups past the DNS server, and it replied in its usual speedy fashion with the expected result. I flipped through the logs as well, and they showed nothing out of the ordinary. What was going on?

At this point, I probably should describe how the company's DNS infrastructure was set up. It had two main data centers: A and B. Each data center had a load-balanced pair of DNS servers set up as active-passive, and the public virtual IP addresses for each were published as the NS records for each domain we serviced. That would cause each data center to service half the DNS load for any set of requests, and due to each data center having a load-balanced pair of DNS servers, we could tolerate a failure of a DNS server without any degradation in customer-facing service.

[Kyle: The beauty of a system like this is that even though DNS has automatic failover if you have more than one NS record, if a DNS server is down, you generally have to wait the 30 seconds for it to time out. That 30-second delay was too long for our needs, so with this design, we could take down any individual DNS server and the load balancer would just forward requests to the remaining server in the data center.]

Anyway, I continued troubleshooting. On a hunch, I started running nslookups against a few domains from my home—maybe the problem was visible only from the outside. Oddly enough, nslookups succeeded for the most part, except for those pointing to one of our most active sites, which had Akamai as a content delivery network (CDN). Akamai requires that you configure your DNS using CNAME, or alias, records, so that its CDN can spider and cache your content. The CNAME records look something like the following:

  • A CNAME pointing www.ourdomain.com to ourdomain.com.edgesuite.net.

  • A CNAME pointing origin-www.ourdomain.com to ourdomain.com.

Surely enough, external requests that hit data-center A would time out and wind up failing over to data-center B. Typical DNS timeouts put this on the order of 30 seconds, which is unacceptable for any kind of commercial Web site. Since Akamai was involved, and the main site I found that was affected was utilizing Akamai, I made the call to Akamai support for assistance.

[Kyle: I can't count how many times I've used personal servers that are colocated outside a corporate network to troubleshoot problems. It can be invaluable to have a perspective on the health of a network service that's completely detached from your corporate network. Think of it as another reason to keep your home server on 24/7.]

______________________

Kyle Rankin is SVP of Security and Infrastructure at Zero, the author of many books including Linux Hardening in Hostile Networks, DevOps Troubleshooting and The Official Ubuntu Server Book, and a columnist for Linux Journal. Follow him @kylerankin