How to be a good (and lazy) System Administrator

If you're anything like the average System Administrator, you are understaffed, underfunded, and overworked. By now, you've also gotten used to the idea that no one knows you exist until the mail server goes down, then you're suddenly on America's Most Wanted. In this article, I'm also assuming that you have many servers that you are responsible for. I'm also assuming that you don't really want to work as hard as you are; if you do, you should become a Windows server manager and begin worrying about frequent patches from Microsoft, security vulnerabilities, virus protection, a clumsy user interface, and lack of native scriptability. I'm not saying that Linux is perfect, but there are a lot of things about Linux that just makes it easier to administer.

As a good System Administrator, you want to get the job done right, but as a lazy System Administrator, you don't want to work too hard to get it done. In this article, I'm going to share a few simple things you can do to make your job easier.

Over the years, I've developed the mantra, “If I have to do it more than once, I write a script to do it.” For example, if I need to check the health of my servers each morning, I'd write a bash script to gather the information, format if for me, and mail the report to me. If I had to make a configuration change on 12 different machines, I'd write a script to do it. Now at first blush, you might think that it's just as easy to do the work manually as it would be to write and debug a script to do the work. But there are some hidden advantages to my approach to... um... work. Once the script is working, the task is repeatable and can either be delegated to lower-level technicians, or automated. Basically, you don't have to do it all; it just all has to get done. We'll talk about scripting a bit more in a minute.

To facilitate scripting tasks and managing multiple servers, the first thing I would do is configure certificate-based authentication on each of my servers. It only takes a couple minutes to do for each server and it can really make your life easier. Since you no longer need to input a password, file transfers, backups and maintenance tasks can all be scripted. As there are plenty of simple instructions on how to configure certificate-based authentication on the Internet, I'll not waste any time describing the process here.

Once we've taken the time to get the authentication working, lets start making our lives easier. What I like to do is create a shell script that exports useful variables, for example

export MAILSERVERS="server1 server2 server3"
export WEBSERVERS="www1 www2 www3 www4"

Then I can write a simple script like this:

### Assess disk space on mail servers
source ./
for i in ${MAILSERVERS} ; do
       echo =========${i} =============
       ssh root@${i} "df"
       echo ============ =============

This simple script allows me to quickly assess the disk utilization on all of my mail servers. It also serves as a convenient template for other such tasks. When I want to script another task, I make a copy of this script, replace the comment at the top to describe the new script's purpose, and replace the body of the for loop.

The thing to notice is that all of my scripts will source the file so that I have a central point of configuration. When I add or remove a server, I simply change this file.

Also, notice the comment at the top of the file. Once you get about 50 different scripts in a single directory, it becomes difficult to remember which script does what, unless you start naming them like, which I refuse to do. So, when I need to figure out which script does what, I type:

grep “###” *

This gives me a nice list of scripts and a brief description of what they do.

The corollary to my mantra about scripting is that if I have to perform a given task every day, week or month, I put the job in cron and send the results to email. Many systems come with directories that hold scripts for cron to run hourly, daily and weekly. I think that's a really nice way to do things, but sometimes you have to be able to determine exactly when a given script runs. To do that, you have to modify the crontab yourself. For example, I don't want my backups running just whenever /etc/cron.daily decides to run them; I want them to start and finish outside of regular business hours. Since I have much more important things than crontab's format to remember and I'm too lazy to look it up each time, I usually insert the following line into my crontabs:

# min   hour    dom     month   dow     command

Then, each time I modify my crontab, I can quickly add the fields I want and move on. I know this isn't Earth shattering, but it's just one of those simple things you can do to save time and effort.

Logging, in the form of syslog, is a feature that comes with Linux but because it tends to produce such a huge volume of data, it's rarely used to it's full potential. Usually, people simply configure logrotate to truncate the logs to keep them from filling up the filesystem. Only when there's a problem do those people go back and look at what their logs want to tell them. To syslog, I'd also add web logs, firewall logs, mail logs and any other logs produced by the daemons on a given server. I would never advocate reading all of these logs line by line. Instead, you should implement some kind of log analysis program, even if it's a just series of grep's piped together. You will need to make regular, incremental changes to your ruleset in order to filter out as much of the noise as possible. However you do it, the reports should be emailed to you on a regular basis and you need to at least glance at them in a timely manner.

Of course, configuring log analysis on many servers seems, to me, like a lot of work. You might consider configuring all of your servers to send their logs to a single workstation. Then you only have to configure one instance of the analysis program, instead of trying to replicate the same configuration on each server. You could even use the technique outlined above to pull the log files from your servers so they can be analyzed locally.

Over the years, I've gotten some pretty exciting benefits from looking at my logs. One time, smartd informed me that one of my IDE harddrives was about to fail, before it actually failed. I was able to plan an outage and replace the drive before it failed and before I lost data. A couple times, I've noticed authentication failures on my web servers and actually called the customer at his desk and resolved the situation. I once discovered a corrupt database index because I happened to be looking at my Apache log file and noticed that the server was taking an inordinate amount of time to serve up an application. After calling the customer at his desk to tell him I noticed the problem, I started working on the problem... before anyone had even reported it. By the time the problem was reported by others, I had it diagnosed and had an ETR so that when customers called, I didn't even bother opening a service ticket. I just told them it would be fixed in half an hour.

I'm also a big fan of server and service monitoring. I used to spend the first part of each day checking to see that all of my servers were health and happy. Now I simply bring up the monitoring console and look for non-green lights and I'm usually aware of problems before my customers are. Let's face it, as soon as your manager realizes that the mail server is down, he's going to go looking for you; he might as well find you in the server room, working on the mail server.

Service monitoring really isn't that hard to set up and it's a great way to know about problems before your customers do. But you can't just set it up and assume it works. I was once in a position where the corporation told all of the departments that we all had to use the new corporate monitoring capability. Of course, this was great news to me since I'd no longer have to provide the monitoring function for my servers. Being a good, and lazy System Administrator, I quickly converted to the corporate monitoring... and held a firedrill. I walked over to one of my servers, shut it down, and started the timer. My pager went off 30 minutes later and that was unacceptable in the environment I was in. After I had a brief conversation with their manager, the monitoring department made a few changes to their procedures and everyone was happy. You should ALWAYS test any monitoring system you implement.

Another benefit of having a solid monitoring system is that you can gather availability and performance metrics. These reports can be taken to management to justify equipment purchases, or to disprove complaints from customers over availability. There is nothing like having hard numbers in a management meeting.

You should also try to anticipate any event that might cause an outage and try to configure your monitoring to detect that event. As a rule of thumb, you should monitor events that happen quickly, more frequently. For example, since my servers could be unplugged rather quickly, I ping them pretty frequently. On the other hand, it's unlikely that their hard drives will fill up in the next 15 minutes, so I monitor drive utilization less often. I usually set alarm thresholds fairly low. For example, if I have a disk that is usually 30% full, I might set my alarm threshold at 45%. When the drive usage passes this point, I know something is wrong, but we're not at the point where we're in danger of a failure. I might even be able to ignore it for a while while I make plans to fix it.

None of what I've described here is difficult and none of it will require any real engineering work. You'll have to put some thought into your monitoring system, but even if all you did was ping your servers and test that the applications, web, mail, etc., are responding, you will gain significant advantages in a short amount of time. And none of this has to be done all at once. Just take a few minutes each time you log into a server and do a few simple things here and there. Eventually, you'll end up with a few minutes each day to catch your breath.

Load Disqus comments