Sysadmin 101: Alerting

Paging

Sending an alert often is referred to as paging or being paged, because in the past, sysadmins, like doctors, carried pagers on them. Their monitoring systems were set to send a basic numerical alert to the pager when there was a problem, so that sysadmins could be alerted even when they weren't at a computer or when they were asleep. Although we still refer to it as paging, and some older-school teams still pass around an actual pager, these days, notifications more often are handled by alerts to mobile phones.

The first question you need to answer when you set up alerting is what method you will use for notifications. When you are deciding how to set up pager notifications, look for a few specific qualities.

Something that will alert you wherever you are geographically.

A number of cool office projects on the web exist where a broken software build triggers a big red flashing light in the office. That kind of notification is fine for office-hour alerts for non-critical systems, but it isn't appropriate as a pager notification even during the day, because a sysadmin who is in a meeting room or at lunch would not be notified. These days, this generally means some kind of notification needs to be sent to your phone.

An alert should stand out from other notifications.

False alarms can be a big problem with paging systems, as sysadmins naturally will start ignoring alerts. Likewise, if you use the same ringtone for alerts that you use for any other email, your brain will start to tune alerts out. If you use email for alerts, use filtering rules so that on-call alerts generate a completely different and louder ringtone from regular emails and vibrate the phone as well, so you can be notified even if you silence your phone or are in a loud room. In the past, when BlackBerries were popular, you could set rules such that certain emails generated a "Level One" alert that was different from regular email notifications.

The BlackBerry days are gone now, and currently, many organizations (in particular startups) use Google Apps for their corporate email. The Gmail Android application lets you set per-folder (called labels) notification rules so you can create a filter that moves all on-call alerts to a particular folder and then set that folder so that it generates a unique alert, vibrates and does so for every new email to that folder. If you don't have that option, most email software that supports multiple accounts will let you set different notifications for each account so you may need to resort to a separate email account just for alerts.

Something that will wake you up all hours of the night.

Some sysadmins are deep sleepers, and whatever notification system you choose needs to be something that will wake them up in the middle of the night. After all, servers always seem to misbehave at around 3am. Pick a ringtone that is loud, possibly obnoxious if necessary, and also make sure to enable phone vibrations. Also configure your alert system to re-send notifications if an alert isn't acknowledged within a couple minutes. Sometimes the first alert isn't enough to wake people up completely, but it might move them from deep sleep to a lighter sleep so the follow-up alert will wake them up.

While ChatOps (using chat as a method of getting notifications and performing administration tasks) might be okay for general non-critical daytime notifications, they are not appropriate for pager alerts. Even if you have an application on your phone set to notify you about unread messages in chat, many chat applications default to a "quiet time" in the middle of the night. If you disable that, you risk being paged in the middle of the night just because someone sent you a message. Also, many third-party ChatOps systems aren't necessarily known for their mission-critical reliability and have had outages that have spanned many hours. You don't want your critical alerts to rely on an unreliable system.

Something that is fast and reliable.

Your notification system needs to be reliable and able to alert you quickly at all times. To me, this means alerting is done in-house, but many organizations opt for third parties to receive and escalate their notifications. Every additional layer you can add to your alerting is another layer of latency and another place where a notification may be dropped. Just make sure whatever method you choose is reliable and that you have some way of discovering when your monitoring system itself is offline.

In the next section, I cover how to set up escalations—meaning, how you alert other members of the team if the person on call isn't responding. Part of setting up escalations is picking a secondary, backup method of notification that relies on a different infrastructure from your primary one. So if you use your corporate Exchange server for primary notifications, you might select a personal Gmail account as a secondary. If you have a Google Apps account as your primary notification, you may pick SMS as your secondary alert.

Email servers have outages like anything else, and the goal here is to make sure that even if your primary method of notifications has an outage, you have some alternate way of finding out about it. I've had a number of occasions where my SMS secondary alert came in before my primary just due to latency with email syncing to my phone.

Create some means of alerting the whole team.

In addition to having individual alerting rules that will page someone who is on call, it's useful to have some way of paging an entire team in the event of an "all hands on deck" crisis. This may be a particular email alias or a particular key word in an email subject. However you set it up, it's important that everyone knows that this is a "pull in case of fire" notification and shouldn't be abused with non-critical messages.

______________________

Kyle Rankin is senior security and infrastructure architect, the author of many books including Linux Hardening in Hostile Networks, DevOps Troubleshooting and The Official Ubuntu Server Book, and a columnist for Linux Journal. Follow him @kylerankin