From the Publisher

To make this process even more fun, there have been a lot of computer-related problems—all related to our Linux systems.

The end of March is quickly approaching as I write. Here at SSC, March is always an exciting time as it is the end of our accounting year. For the last two weeks Gena Shurtleff and I (with help from others) have been working on a budget for the next year of LJ--something that has made some of us very grouchy. [mainly, our publisher --Ed.]

To make this process even more fun, there have been a lot of computer-related problems—all related to our Linux systems. [Note: if you are humor-impaired you may want to skip a lot of this.] First, our main server failed. Linux apparently caused a pin to break off the cable to the external SCSI disk drive. Then, a week or so later, Linux broke a head on the disk drive in our firewall. Next, the fan in our editor's computer started growling at her. On top of all that, various systems in the office were mysteriously crashing or exhibiting very strange behavior in general. The end result was too many hours of downtime, lost e-mail and an unhappy working relationship with our computers.

We began to wonder if Linux was good enough for us, and if perhaps Windows NT might support “automatic pin re-soldering” and “disk-head replacement”.

Once things calmed down we took a closer look at the problems. By the way, we refers specifically to Jay Painter, our new systems administrator, Peter Struijk and me. First, we concluded that it probably wasn't the fault of Linux that hardware was breaking. As scary as it is that multitasking operating systems write to disks whenever they think it's a good time, this really isn't a reason for a cable to break.

To address the issue of being down for longer periods of time than we thought appropriate, let's look at some specific cases.

Note: at this time, our various systems ran a host of different software versions (kernels and C libraries), and we were in the midst of converting them to Debian in order to have consistency across all machines.

The Server Failure

The failure of the cable caused the data on the server disks to be scrambled. Fortunately, we had back-up copies of the files, so it seemed like a good time to do an upgrade. We made the logical decision to reload a standard Slackware system (rather than try to change to Debian), restore the user files and be on our way. It turned out to be not so easy. The new load of Slackware had more differences from the old version than we expected. Libraries were different. NFS and NIS were different. Adobe fonts we use for doing reference cards had to be reloaded. Configuration files for groff had to be updated. A lot of work was done to get the new configuration talking to all the old systems and to get everything tuned.

Firewall Failure

Possibly inspired by the extra work the firewall was doing during the time the server was down, a head died on the disk in the firewall the next weekend resulting in the loss of a lot of mail. Why was it lost? Why wasn't it queued and then forwarded? Was this another Linux shortcoming?

On investigation we found that backup MX (mail exchanger) records were in place to take care of this very problem; however, they were pointing to the wrong machine. Again, the problem could not be pinned on Linux; it was an administrative error by a previous systems administrator. The mistake went undetected because this backup had never been exercised before, since Linux had been working flawlessly.

Strange Software Problems

Let's move on to those strange software problems I mentioned. Surely we can find something to pin on Linux here.

One machine, used as our DNS name server, had been less than reliable. Two things in particular happened quite regularly. The first was that syslogd, the system log daemon, would hang in a loop eating up all available CPU time. While this problem appeared to be related to the location of the log file disappearing (caused by a reboot of the file server or a network problem), we haven't been able to fix it. However, it doesn't appear to happen in newer Linux releases (our problem machine is running 1.2.13) and, while it is irritating, it does not cause the machine to crash—just to run slower than normal.

The other problem on this same machine was stranger, although it turned out to be fixable. Multiple copies of crond (the cron daemon) kept appearing on the machine even though only one was initiated at boot time. One day, I found 13 crond jobs running, killed 12 and, a few hours later, found three still running.

At this point Jay jokingly said, “Maybe there is a cron job starting cron jobs.” Well, since there were processes being started by cron that didn't exist on the other machines, I started looking around for suspicious jobs. The first couple of extraneous jobs I found were benign, but then I found one that made both of us realize that Jay's attempt at a joke wasn't a joke at all. There was, in fact, a cron job that initiated another cron job. Or, more accurately, a cron job that grepped for everything but a cron job, attempted to kill all cron jobs and then started a new cron job. In other words, it looked like a partially written script to do “who knows what”--nothing that would actually work. It was signed and dated, so we could see both who wrote it and that the creation date was about the time when stability problems first appeared. Again, we had found “pilot error”, not a problem with Linux.

There are more of these stories than there is space to tell them. Basically what we found out was that even though various distributions may have some kinks in them like a wrong file permission at install time, they do all install. That's true for Caldera, Linux FT, Debian, Red Hat, Slackware, Yggdrasil and all the others. Software does not wear out. If the system is running it is not likely to stop aside from hardware errors. As a case in point, I still have a 0.99 kernel running on the main machine at that was installed in August, 1993. It is an NFS server with three 38.8KB modems on it. The hardware is a 386DX40 with 8MB of RAM. Why haven't I upgraded it? It works, and it is extremely stable. The last reboot was in November 1996, when I turned off the machine to remove a zip drive from the SCSI bus.


Phil Hughes