How to be a good (and lazy) System Administrator

If you're anything like the average System Administrator, you are understaffed, underfunded, and overworked. By now, you've also gotten used to the idea that no one knows you exist until the mail server goes down, then you're suddenly on America's Most Wanted. In this article, I'm also assuming that you have many servers that you are responsible for. I'm also assuming that you don't really want to work as hard as you are; if you do, you should become a Windows server manager and begin worrying about frequent patches from Microsoft, security vulnerabilities, virus protection, a clumsy user interface, and lack of native scriptability. I'm not saying that Linux is perfect, but there are a lot of things about Linux that just makes it easier to administer.

As a good System Administrator, you want to get the job done right, but as a lazy System Administrator, you don't want to work too hard to get it done. In this article, I'm going to share a few simple things you can do to make your job easier.

Over the years, I've developed the mantra, “If I have to do it more than once, I write a script to do it.” For example, if I need to check the health of my servers each morning, I'd write a bash script to gather the information, format if for me, and mail the report to me. If I had to make a configuration change on 12 different machines, I'd write a script to do it. Now at first blush, you might think that it's just as easy to do the work manually as it would be to write and debug a script to do the work. But there are some hidden advantages to my approach to... um... work. Once the script is working, the task is repeatable and can either be delegated to lower-level technicians, or automated. Basically, you don't have to do it all; it just all has to get done. We'll talk about scripting a bit more in a minute.

To facilitate scripting tasks and managing multiple servers, the first thing I would do is configure certificate-based authentication on each of my servers. It only takes a couple minutes to do for each server and it can really make your life easier. Since you no longer need to input a password, file transfers, backups and maintenance tasks can all be scripted. As there are plenty of simple instructions on how to configure certificate-based authentication on the Internet, I'll not waste any time describing the process here.

Once we've taken the time to get the authentication working, lets start making our lives easier. What I like to do is create a shell script that exports useful variables, for example

# servers.sh
export MAILSERVERS="server1 server2 server3"
export WEBSERVERS="www1 www2 www3 www4"

Then I can write a simple script like this:

#!/bin/bash
### Assess disk space on mail servers
source ./servers.sh
for i in ${MAILSERVERS} ; do
       echo =========${i} =============
       ssh root@${i} "df"
       echo ============ =============
done

This simple script allows me to quickly assess the disk utilization on all of my mail servers. It also serves as a convenient template for other such tasks. When I want to script another task, I make a copy of this script, replace the comment at the top to describe the new script's purpose, and replace the body of the for loop.

The thing to notice is that all of my scripts will source the servers.sh file so that I have a central point of configuration. When I add or remove a server, I simply change this file.

Also, notice the comment at the top of the file. Once you get about 50 different scripts in a single directory, it becomes difficult to remember which script does what, unless you start naming them like assess_the_disk_space_on_mail_servers.sh, which I refuse to do. So, when I need to figure out which script does what, I type:

grep “###” *

This gives me a nice list of scripts and a brief description of what they do.

The corollary to my mantra about scripting is that if I have to perform a given task every day, week or month, I put the job in cron and send the results to email. Many systems come with directories that hold scripts for cron to run hourly, daily and weekly. I think that's a really nice way to do things, but sometimes you have to be able to determine exactly when a given script runs. To do that, you have to modify the crontab yourself. For example, I don't want my backups running just whenever /etc/cron.daily decides to run them; I want them to start and finish outside of regular business hours. Since I have much more important things than crontab's format to remember and I'm too lazy to look it up each time, I usually insert the following line into my crontabs:

# min   hour    dom     month   dow     command

Then, each time I modify my crontab, I can quickly add the fields I want and move on. I know this isn't Earth shattering, but it's just one of those simple things you can do to save time and effort.

Logging, in the form of syslog, is a feature that comes with Linux but because it tends to produce such a huge volume of data, it's rarely used to it's full potential. Usually, people simply configure logrotate to truncate the logs to keep them from filling up the filesystem. Only when there's a problem do those people go back and look at what their logs want to tell them. To syslog, I'd also add web logs, firewall logs, mail logs and any other logs produced by the daemons on a given server. I would never advocate reading all of these logs line by line. Instead, you should implement some kind of log analysis program, even if it's a just series of grep's piped together. You will need to make regular, incremental changes to your ruleset in order to filter out as much of the noise as possible. However you do it, the reports should be emailed to you on a regular basis and you need to at least glance at them in a timely manner.

Of course, configuring log analysis on many servers seems, to me, like a lot of work. You might consider configuring all of your servers to send their logs to a single workstation. Then you only have to configure one instance of the analysis program, instead of trying to replicate the same configuration on each server. You could even use the technique outlined above to pull the log files from your servers so they can be analyzed locally.

Over the years, I've gotten some pretty exciting benefits from looking at my logs. One time, smartd informed me that one of my IDE harddrives was about to fail, before it actually failed. I was able to plan an outage and replace the drive before it failed and before I lost data. A couple times, I've noticed authentication failures on my web servers and actually called the customer at his desk and resolved the situation. I once discovered a corrupt database index because I happened to be looking at my Apache log file and noticed that the server was taking an inordinate amount of time to serve up an application. After calling the customer at his desk to tell him I noticed the problem, I started working on the problem... before anyone had even reported it. By the time the problem was reported by others, I had it diagnosed and had an ETR so that when customers called, I didn't even bother opening a service ticket. I just told them it would be fixed in half an hour.

I'm also a big fan of server and service monitoring. I used to spend the first part of each day checking to see that all of my servers were health and happy. Now I simply bring up the monitoring console and look for non-green lights and I'm usually aware of problems before my customers are. Let's face it, as soon as your manager realizes that the mail server is down, he's going to go looking for you; he might as well find you in the server room, working on the mail server.

Service monitoring really isn't that hard to set up and it's a great way to know about problems before your customers do. But you can't just set it up and assume it works. I was once in a position where the corporation told all of the departments that we all had to use the new corporate monitoring capability. Of course, this was great news to me since I'd no longer have to provide the monitoring function for my servers. Being a good, and lazy System Administrator, I quickly converted to the corporate monitoring... and held a firedrill. I walked over to one of my servers, shut it down, and started the timer. My pager went off 30 minutes later and that was unacceptable in the environment I was in. After I had a brief conversation with their manager, the monitoring department made a few changes to their procedures and everyone was happy. You should ALWAYS test any monitoring system you implement.

Another benefit of having a solid monitoring system is that you can gather availability and performance metrics. These reports can be taken to management to justify equipment purchases, or to disprove complaints from customers over availability. There is nothing like having hard numbers in a management meeting.

You should also try to anticipate any event that might cause an outage and try to configure your monitoring to detect that event. As a rule of thumb, you should monitor events that happen quickly, more frequently. For example, since my servers could be unplugged rather quickly, I ping them pretty frequently. On the other hand, it's unlikely that their hard drives will fill up in the next 15 minutes, so I monitor drive utilization less often. I usually set alarm thresholds fairly low. For example, if I have a disk that is usually 30% full, I might set my alarm threshold at 45%. When the drive usage passes this point, I know something is wrong, but we're not at the point where we're in danger of a failure. I might even be able to ignore it for a while while I make plans to fix it.

None of what I've described here is difficult and none of it will require any real engineering work. You'll have to put some thought into your monitoring system, but even if all you did was ping your servers and test that the applications, web, mail, etc., are responding, you will gain significant advantages in a short amount of time. And none of this has to be done all at once. Just take a few minutes each time you log into a server and do a few simple things here and there. Eventually, you'll end up with a few minutes each day to catch your breath.

______________________

Mike Diehl is a freelance Computer Nerd specializing in Linux administration, programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

ssh root@${i} "df"?

Anonymous's picture

Am I the only one who has an issue with running non-privileged commands like df as root? And with allowing root access through certificate-based authentication in general?

puppet is where it's at.

miker's picture

Good article. What you describe is basically creating your own version of Puppet. You write little scripts here and there to do/automate different tasks. The problem with this is that you are usually the only one that will see these scripts. Besides making it easy to miss bugs, you also lose the benefit of having peers and other people much smarter than you reviewing your scripts. There may be a way easier way to do foo that you never thought of.

This is where Puppet from Reductive Labs comes in. http://reductivelabs.com/projects/puppet/ The first paragraph at the puppet site;

Puppet lets you centrally manage every important aspect of your system using a cross-platform specification language that manages all the separate elements normally aggregated in different files, like users, cron jobs, and hosts, along with obviously discrete elements like packages, services, and files.

Basically, you have a "puppet master" that describes your network. Each box on the network becomes a puppet. A puppet contacts the puppetmaster at regular intervals to see what it should be doing. Stats can be gathered. Config changes can be done once and deployed to every puppet that needs it.

I described just a tiny bit of what puppet is all about. The goal of the project is to take all those little scripts and ideas you talk about and turning them into recipes that everyone else can use/abuse and improve. Puppet is written in Ruby, and the syntax for configuration and recipes is a breeze to learn.

Automation is the way to go

Anonymous's picture

Automation is the way to go even if you have a couple of servers. I recommend cfengine which is an excellent tool for managing system configuration. It will save you a lot of time that you can use to gain more knowledge about the system.

documentation

Anonymous's picture

I am a huge fan of automation and I agree with most everything you have written here.

In a smaller "few server" environment cron works just fine. I just don't think it scales well for a larger installation of servers - and I whole heartedly recommend an enterprise scheduling solution.

Another thing to point out and that is often overlooked is supporting documentation. You don't necessarily need docs for every little cron task and system monitoring script you write. But you DO need docs that outline the environment you've built for your automation.

Just because you're a 1 man shop today, that may not be the case tomorrow. Documenting what you have in place will help accelerate the growth of a new admin.

Nice article...

Ever heard of Nagios?

Archangel's picture

Nagios once configured correctly can do everything you mentioned in this great article and then some. Try it out if you haven't already done so. I think you may like it a lot.
You can even write Nagios modules to do custom stuff which are essentially scripts that do more pin pointed processes as you see fit. There are several modules already written for Nagios that you might find useful if you are not satisfied of the already great out of the box product. There are enough Nagios modules out there that pretty much cover every spectrum of a SysAdmin's requirements.
With products like Nagios and others I see the need for a SysAdmins becoming less and less important.

Great article though.

Ever heard of Nagios?

Archangel's picture

Nagios once configured correctly can do everything you mentioned in this great article and then some. Try it out if you haven't already done so. I think you may like it a lot.
You can even write Nagios modules to do custom stuff which are essentially scripts that do more pin pointed processes as you see fit. There are several modules already written for Nagios that you might find useful if you are not satisfied of the already great out of the box product. There are enough Nagios modules out there that pretty much cover every spectrum of a SysAdmin's requirements.
With products like Nagios and others I see the need for a SysAdmins becoming less and less important.

Great article though.

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix