Complexity, Uptime and the End of the World

Poorly implemented monitoring systems can drive an administrator crazy. At best, they are distracting. At worst, they'll keep whoever is on pager duty up for nights at a time. This article discusses the best practices for designing systems that will keep your systems up and stay quiet when nothing is wrong.

After being in the computer industry for 20-odd years, I've come to realize there is a single thing everyone can agree on: no matter how new, how stable or how awesome any piece of technology is, it will break.

Fortunately, system administrators plan for these things. Whether it's a redundant server in the data center or a second availability zone in EC2, the first and best way to ensure uptime is to decrease the number of single points of failure across the network. There are drawbacks to this approach though. Increasing a Web cluster from one to ten boxes decreases the chance of hardware failure taking down the entire site by a factor of ten. Although this increases redundancy, it also dramatically increases the expense and complexity of the network. Instead of running a single server, there's now a series of boxes with a shared data store and load balancers. This complexity comes with drawbacks. It's ten times as likely that hardware failure will occur and a system administrator will wake up, and that only counts the actual Web servers. Whether you're in a data center or in the cloud, this kind of layering of services significantly increases the chances that a single device will go down and alert in the middle of the night.

Preventing this kind of thing is usually high on a system administrator's list of desires, even if it tends to be pushed lower on the priority list in practice. Waking up in the middle of the night to fix a server or piece of software is bad for productivity and bad for morale. There are two steps that can help make sure this doesn't happen. The first is to implement the necessary amount of redundancy without increasing the complexity of the system past what is required for it to run. The second step is to implement a monitoring system that will allow you to monitor exactly what you want as opposed to worrying about which individual box is using how much RAM.

The End of the World methodology is a thought experiment designed to help choose the level of redundancy and complexity required for the application. It helps determine acceptable scenarios for downtime. Often when you ask people when it's acceptable for their sites to be down, they'll say that it never is, but that's not exactly true. If an asteroid strikes Earth and destroys most of the human race, is it necessary for site to stay up? If the application is NORAD, maybe it is necessary, but for Groupon, not so much. That kind of uptime requires massive infrastructure placed in strategic locations around the globe and the kind of capital investments and staffing to which only large governments usually have access.

Backing off step by step from this kind of over-the-top disaster, you can find where the acceptable level is. What if the disaster is localized to just the continent? Is it acceptable to be down at this time? If the site is focused on those customers, it may be. If the site is an international tool, such as Amazon or Google, possibly not. What if it's local to the data center or availability zone where your boxes are kept? Most shops would like to stay up even if a backhoe cuts the power to their data center.

When the problem is framed this way, it becomes obvious that there is an acceptable level of downtime. Administrators can find the sweet spot between uptime and complexity. Finding the outer bounds of these requirements will uncover the requirements for monitoring the service as a whole. Notice that this is a service and not a server. Although it's easy to monitor whether a network interface is available, it's far more interesting to monitor the health of an entire cluster. In our ten server cluster, if www6 goes down on a cluster that gets 40% utilization at night, it's probably not worth getting up for. If the entire Web service goes down, that's something that needs to be acted upon immediately.

A monitoring system is basically a scheduler and data collection tool that executes checks against a service and reports the results back to be presented on a common dashboard. It seems like one of those innocuous pieces of software that just runs in background, like network graphs or log analysis, but it has a hidden ability to hurt an entire engineering department. False positives can wake people up in the middle of the night and cause ongoing dread of going on pager duty. This results in people putting things in maintenance mode to quiet the false positives and can end up with unnoticed failure of services.

Dealing with false positives often is more of a policy problem than a design problem. Choosing what to monitor is far more important than choosing how to monitor it. Many companies have a history of monitoring things like CPU and RAM usage. They feel that sometimes spikes are precursors to crashes, so alerting on them is reasonable. The problem here is things that can cause the computer to use CPU and RAM, and most of them are within the normal bounds of an operating system. When the system administrator checks on the box, the resource is in use, but the application is functioning without a problem. Unless there is a clear documented link between RAM over a certain level and a crashing service, skipping on alerts for this kind of resource use leads to far fewer false positives. Monitors should be tied to a defined good or bad value with respect to a particular production service.

Another path that leads to a large number of false positives is using percentages in differently equipped boxes. For example, if a system has a 137G drive that's 95% full, it has only around 6G free. On sites with heavy traffic or sites with a lot of instrumentation in the code, 6G can go pretty quickly. Applying this monitor to the same Web server with a 2TB disk seems like less of an emergency. Leaving only 100G free on a system overnight is usually not a problem. If the average disk use for a day of work for a particular box is 5G, monitoring for 15G left and only allowing alerts for it during business hours will give three days notice. Alerts this far ahead of time let the system administrator plan downtime for the system if it is required, so that the server can be maintained without taking the supported service down.

The two most popular open-source monitoring systems are Zenoss and Nagios. Both of these systems offer similar monitoring capabilities. Zenoss provides more functionality and ease of use, incorporating some basic auto-discovery of nodes, built-in RRD graphing, syslog management and the ability to deduplicate events. Nagios provides a larger community and lighter install than Zenoss that allows administrators to use their own graphing solutions without duplicating software. The best part is that they have a common format for monitoring scripts; the processes that do the actual checking of services.

Although both systems come with basic templates for monitoring HTTP ports with similarly popular services, much of the power of these systems comes from the ability to write custom scripts. This is a great way to check not only that a Web server is up, but also that the application itself is working. Below is an example of a script that will monitor the success of Hudson jobs by calling its JSON API:


#!/usr/bin/env ruby
# Call as:
# check_hudson_job.rb ${jobname} ${hostname}

require 'rubygems'
require 'json'
require 'net/http'

jobname = ARGV[0]
hostname = ARGV[1]

url = URI.parse("http://#{hostname}/job/#{jobname}/
↪lastBuild/api/json")
res = JSON.parse(Net::HTTP.get_response(url).body)
lastResult = res["result"]

if lastResult == "SUCCESS"
 puts "OK|Status=0"
 exit(0)
else
 failurl = URI.parse("http://#{hostname}/job/
↪#{jobname}/api/json")
 failres = JSON.parse(Net::HTTP.get_response(failurl).body)
 health = failres["healthReport"][0]["description"]
 puts "Job #{jobname} broke: #{health}"
 exit(1)
end

The monitoring system calls the code with command-line parameters of the name of the job and the name of the host. The code then looks for the result from the Hudson server and checks for success. The return value and exit code are how the monitoring script replies to the monitoring system. A nonzero exit code indicates a failure, and the return value is a string that the system displays as the reason for the failure. On Zenoss, this is also used in deduplication. On success, the monitoring script has an exit code of 0 with a string returned in a special form for the system to process (see code).

Using this structure, system administrators can work with developers to build custom URLs that the monitoring system can access to determine the health of the application without worrying about every system in the set.

It may seem hard to swallow that it's acceptable to leave a box down overnight. It may be the first in a cascading series of failures that cause multiple servers to go down, eventually resulting in a downed service, but this can be addressed directly from the load balancer or front-end appliance instead of indirectly looking at the boxes themselves. Using this method, the alert can be set to go off after a certain number of boxes fail at certain times of day, and there is no need to solve harder problems, such as requiring each box to know the state of the entire cluster.

So far, the design for the systems has been fairly agnostic as far as geographies and cloud footprint. For most applications, this doesn't make a lot of difference. Usually, with multiple geographies, each data center has its own instance of the monitoring system with each one monitoring its siblings in the other locations. Operating in the cloud offers greater flexibility. Although it still is necessary to monitor the monitoring system, this can be done easily using Amazon's great, but far less configurable system to monitor Nagios or Zenoss EC2 instances.

What really stands out about Amazon's cloud is that it's elastic. Hooking up the EC2 command-line programs to the monitoring service will allow new boxes to be launched if some are experiencing problems due to resource starvation, load or programs crashing on the box. Of course, this needs to be kept in check, or the number of instances could spiral out of control, but within reasonable bounds, launching new instances in place of crashing or overloaded ones from inside of a monitoring script is relatively easy.

Here is an example of a script that monitors the load of a Hadoop cluster and adds more boxes as the number of jobs running increases:


#!/bin/bash
# Call as:
# increase_amazon_set.sh ${threshold} ${AMI}

THRESHOLD=$1
AMI=$2

NUM_JOBS=`/opt/hadoop/current/bin/hadoop job -list | 
 ↪head -1 | awk {'print $1'}`

if [[ $NUM_JOBS -gt $THRESHOLD ]] ; then
 echo "Warning: $NUM_JOBS running, increasing cluster size by 3"
 ec2-run-instances $AMI -n 3 --availability-zone us-east-1a
 exit 1;
else
 echo "OK|Status=0"
 exit 0;
fi

This follows the same format as the previous script, passing in variables from the command line and returning values to the monitoring system using the exit condition and returned strings. The big difference here is that you're not just monitoring a problem and passing it off to a system administrator act on it. This script acts as an orchestrator, attempting to fix the problem it sees. Although care should be taken to place proper bounds on the way this works, and the computer should not be able to run amuck on the network, this kind of intelligent scheduler can be a powerful tool in automating tasks.

Although the idea of setting up a new monitoring system from scratch with great alerting rules and intelligent orchestration is a great idea, it's often just not possible. Most organizations have a monitoring system in place already, and often it's full of old alerts and boxes that have been placed in maintenance mode because they're more noisy than broken. If this is the case, it's time to cut out the cruft. Delete all the current alerts and take everything out of maintenance mode that isn't actually undergoing maintenance. Take the top ten noisy and badly behaved devices, and either stop monitoring the items that are provoking false positives or rewrite the scripts so they provide more meaningful data. When these first ten under control, move to the next group. It make take a few iterations over a few days, but in the end, you'll care more about the messages coming from what could be a very powerful tool for you.

Monitoring systems often are overlooked as a required annoyance, but with a little bit of effort, they can be made to work for you. Monitoring for services, looking at clustered applications and alerting only on actual errors that can be handled provides real metrics to use for capacity planning and lets system administrators sleep through the night so that they can be more proactive from day to day.

Beacon photo via Shutterstock.com.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

100% stable is the

R. Larenhorst's picture

So far it has been very hard for us to find a monitoring system to secure a 100% stable network. We've spent a lot of time on this as it's one of the most important aspect of a heavily used network of sites. Also, to secure the uptime you need to decrease the number of single points of failure across the entire network. This can cause you a lot of time and headaches.

Ralph Larenhorst,
http://wlzine.com

Awesome

anand uday's picture

This is such a great resource that you are providing and you give it away for free. I enjoy seeing websites that understand the value of providing a prime resource for free. I truly loved reading your post. Thanks! make fast money

Children no longer obey their

maya123's picture

Children no longer obey their parents, every man wants to write a book, and it is evident that the end of the world is fast approaching free arcade games

Poorly implemented monitoring

diu's picture

Poorly implemented monitoring systems can drive an administrator crazy. At best, they are distracting.Jobs in the UK

Nice Information.

kamani's picture

Fortunately, system administrators plan for these things. Whether it's a redundant server in the data center or a second availability zone in EC2, the first and best way to ensure uptime is to decrease the number of single points of failure across the network. There are drawbacks to this approach though. Brustvergrößerung

Great

Anonymous's picture

Really looking forward to read more.ppi.

complexity

Elikhas's picture

I'm not wrong, and this issue is on the rise worldwide. Good topics this blog so only we can offer to visitors can comment and talk with other members. Thanks for letting me partipar to my site Ver Filmes Online

Great Post

sensangeet's picture

This web site is really a walk-through for all of the info you wanted about this and didn’t know who to ask. Glimpse here, and you’ll definitely discover it.Noleggio Auto Creta.

Great Post

Anonymous's picture

I confess, I’ve not been on this weblog in a long time. nonetheless it was one more delight to read your great articles.www.filmconnection.com.

Awesome

sunita's picture

Hello guys, novice here. I’ve lurked around here for a little while and thought I’d take part in! Appears like you’ve got an amazing nice place here.
indoor outdoor rugs

complexity uptime

Roxelly's picture

I am a system analyst and work in a company where she made an upgrade server cluster. Until then we had no problem with updates. no doubt this issue is very relevant and I had the pleasure of reading an article on the site so linkkei dealing with various issues of technology

Great

sunita's picture

I almost accidentally visited to this site.You compose very detailed.After studying you site, your internet site is extremely useful for me .Really looking.
root canal

Hi, I wanted to show you one

Anonymous's picture

Hi, I wanted to show you one app for smartphones we developed to explain all the most important theories about 2012 year. The name of the app is 2012 TheCrossing. You can find on iTunes store and Android Market. In this blog you can have a preview fo contents you will find into the application - http://2012thecrossing.wordpress.com/
Thanks and enjoy our game!
Enzo

http://www.2012thecrossing.com/

yes, good job

Anonymous's picture

yes, good job

nice one

Anonymous's picture

I am very much impressed from your post.It has amazing information.I learned lot of new things which explores my knowledge in various developments.So i must appreciate your efforts on posting these information.Buy Backlinks|High Pagerank Backlinks

Great Post

Anonymous's picture

Brilliant article! It’s so refreshing to see there still exist some real blogs today which are actually worth reading.هوامير البورصة.

Awesome article

jelyy's picture

nonetheless it was one more delight to read your great articles.
Whats your interest.

theories about 2012 year - a game to play on

EnzoT's picture

Hi, I wanted to show you one app for smartphones we developed to explain all the most important theories about 2012 year. The name of the app is 2012 TheCrossing. You can find on iTunes store and Android Market. In this blog you can have a preview fo contents you will find into the application - http://2012thecrossing.wordpress.com/
Thanks and enjoy our game!
Enzo

http://www.2012thecrossing.com/

Michael what do you think

 Dmitriy's picture

Michael what do you think about remote SaaS monitoring services?

Xymon for the win

Anonymous's picture

We just recently implemented a new monitoring system and we settled on xymon/hobbit after trying Nagios and Zenoss (years earlier). Very we glad we did. It's very lightweight and stable. The other options are great but we found the simplicity of Xymon refreshing which works better for our small team.

Other monitoring systems?

bigbenaugust's picture

Is xymon (the software formerly known as hobbit) too old-school to get a mention?

http://xymon.sourceforge.net/

--Ben

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix