It's Always DNS's Fault!
It's always better to learn from someone else's mistakes than from your own. In this column, Kyle Rankin or Bill Childers tells a story from his years as a systems administrator, and the other chimes in from time to time. It's a win-win: you get to learn from their experiences, and they get to make snide comments to each other. Today's episode is narrated by Bill.
Some Days, You're the Pigeon...
I was suffering, badly. We had just finished an all-night switch migration on our production Storage Area Network while I was hacking up a lung fighting walking pneumonia. Even though I did my part of the all-nighter from home, I was exhausted. So when my pager went off at 9am that morning, allowing me a mere four hours of sleep, I was treading dangerously close to zombie territory.
I looked at the pager and saw that someone had pushed the dreaded "Panic Button", a Web-based tool we'd made that would alert the larger IT team to an unknown high-priority issue. I sat up, reeling and asked my wife to begin the caffeine IV drip that would wake me up while I slowly started banging synapses together, hoping for a spark. According to the report, our DNS infrastructure was timing out on a lot of requests, causing overall site slowdown. I had to re-read that e-mail several times for it to sink into my oxygen-and-sleep-deprived brain. How could DNS be timing out, and why hasn't our internal monitoring caught that? We monitored the DNS servers and service levels internally, and if performance was bad, I should have been the first to know. Something smelled really funny, and it wasn't me, despite the pneumonia-induced fever.
[Kyle: I'll pretend I didn't see the "something smelled funny" comment, as it's too easy. The funny thing here was that we had a long-standing tradition of DNS being blamed whenever there was any sort of networking problem. I've said before that people tend to blame the technology they understand least. This case was one of the first times that it actually seemed (at least on the surface) to be a DNS issue.]
I started checking on things as I dialed in to the conference call for this issue. Our monitoring system said nothing was awry, and response times for DNS were normal. I ran a few nslookups past the DNS server, and it replied in its usual speedy fashion with the expected result. I flipped through the logs as well, and they showed nothing out of the ordinary. What was going on?
At this point, I probably should describe how the company's DNS infrastructure was set up. It had two main data centers: A and B. Each data center had a load-balanced pair of DNS servers set up as active-passive, and the public virtual IP addresses for each were published as the NS records for each domain we serviced. That would cause each data center to service half the DNS load for any set of requests, and due to each data center having a load-balanced pair of DNS servers, we could tolerate a failure of a DNS server without any degradation in customer-facing service.
[Kyle: The beauty of a system like this is that even though DNS has automatic failover if you have more than one NS record, if a DNS server is down, you generally have to wait the 30 seconds for it to time out. That 30-second delay was too long for our needs, so with this design, we could take down any individual DNS server and the load balancer would just forward requests to the remaining server in the data center.]
Anyway, I continued troubleshooting. On a hunch, I started running nslookups against a few domains from my home—maybe the problem was visible only from the outside. Oddly enough, nslookups succeeded for the most part, except for those pointing to one of our most active sites, which had Akamai as a content delivery network (CDN). Akamai requires that you configure your DNS using CNAME, or alias, records, so that its CDN can spider and cache your content. The CNAME records look something like the following:
A CNAME pointing www.ourdomain.com to ourdomain.com.edgesuite.net.
A CNAME pointing origin-www.ourdomain.com to ourdomain.com.
Surely enough, external requests that hit data-center A would time out and wind up failing over to data-center B. Typical DNS timeouts put this on the order of 30 seconds, which is unacceptable for any kind of commercial Web site. Since Akamai was involved, and the main site I found that was affected was utilizing Akamai, I made the call to Akamai support for assistance.
[Kyle: I can't count how many times I've used personal servers that are colocated outside a corporate network to troubleshoot problems. It can be invaluable to have a perspective on the health of a network service that's completely detached from your corporate network. Think of it as another reason to keep your home server on 24/7.]
...and Some Days You're the Statue.
At this point, I had the Akamai folks looking at the issue from their end, and after a couple hours of back-and-forth troubleshooting, they announced that the problem was not with their setup, and it had to be something in our DNS servers. However, all the tests I did within the data center returned correctly and instantly. It was only tests from the outside that timed out and failed to data-center B, and even those were sporadic. By this time it was after noon, and even the caffeine IV drip was wearing off. I was tired, sick, and my brain was not firing on all cylinders.
It was at about this time that I started getting e-mail messages from my pointy-headed boss about how I was "not doing anything" to fix the problem, and he wanted status updates every half-hour on how things were progressing. I replied that I either could update him every half-hour or work on the problem like I had been.
That e-mail message made my cell phone ring, instantly. My boss was on the line, demanding I reboot the DNS server in an attempt to "fix" the problem. However, without any signs of anything wrong on the DNS server other than the failed queries, I was reluctant just to bounce the server, because if the error condition cleared, we'd have no further way to collect information about the problem.
Kyle wound up finding an odd bug in an older version of BIND where a counter rollover caused weird things to happen. That bug supposedly was fixed in the version we were running, but it was a lead to go on, so I made the call, reluctantly, to restart the DNS service on the primary DNS server. Much to my surprise, once we did that, the timeouts stopped occurring. All of a sudden, our DNS infrastructure was back up to 100%, and site performance returned to its normal levels.
[Kyle: It's worth noting that the DNS process on this system had been up and stable for more than a year. Although it was technically possible that an internal uptime counter rollover (like the 498-day uptime rollover on older Linux kernels) could cause strange behavior, it really seemed like grasping at straws, and I was surprised when it seemed to fix the problem. That, of course, brought up other questions though—were we going to have to bounce our DNS service every year?]
My boss called me after the site performance returned. To this day, I don't know if the call was to gloat about his "solution" being correct, or if he called to chastise me for waiting so long to restart the DNS server. I explained that although the issue was no longer occurring, we had zero information as to the root cause of the issue, and that it was not "fixed". Rebooting things randomly is what Windows admins do when things act up. UNIX system administrators tend to try to reach the heart of the issue. At the end of the call, it was apparent he didn't care about a fix. He simply wanted the site back to its normal performance. I finally passed out, exhausted, yet feeling worried that we had not seen the last of this issue.
Listen to Your Hunches
Fast-forward a couple weeks later. I'm feeling better, the pneumonia's defeated, and I'm back at work. True to my gut feeling, the issue spontaneously re-occurred again. People around the office started panicking, blaming the DNS infrastructure and flailing about in general. Kyle and I immediately set to troubleshooting again, but just like the time before, we couldn't find a single thing wrong with the DNS server. This time though, I was more on the ball, and I remembered that the DNS servers were fronted by a load balancer. On a hunch, I asked the network engineer if he had noticed any issues with the load balancer. He investigated and saw weirdness in the log files of the unit. After a little more conversation with him, he agreed there was a problem on the primary load balancer, and the decision was made to fail over to the backup load balancer. Once the failover happened, the DNS issue we were seeing cleared up again. All along, we were fighting a flaky load balancer, not an issue on the DNS server.
Several lessons came out of this issue. The biggest one is that it's easy to lose sight of all the technologies that come into play in a modern data center. My team was responsible for the UNIX systems, so we naturally tested and troubleshot the servers, but initially didn't think that the network possibly could be a problem. Always be sure to look outside your realm of responsibility, as the problem may lie there.
[Kyle: It's funny, because I've known people who default to looking outside their realm of responsibility when there is a problem. We should have been clued off when we noticed that internal DNS requests always worked while external ones (ones through the load balancer) were flaky. But like Bill said, once we rebooted the service and the problem disappeared, there wasn't any troubleshooting left to do.]
Another lesson was one I already knew, but it was highlighted that day. Rebooting a server that's misbehaving is an absolute last resort, as you'll wind up losing the problem in the first place, and you'll never figure out what the root cause is.
In all, although this issue did cost the company money that day due to the poor site performance, it was a good learning experience. I think of this incident a lot when designing new infrastructure or when faced with a new and unknown problem. It reminds me to be thorough when troubleshooting, to look at every possibility and not assume anything.