From the Publisher

To make this process even more fun, there have been a lot of computer-related problems—all related to our Linux systems.
What We Are Doing Differently

We are now back up and running and continuing our conversion to Debian. We are fixing problems as we find them—not just hardware and software but also administrative. For example, if we had been able to find the cellular phone number of our ISP, we could have had him address some problems much more quickly than by communicating via e-mail from a system that was mostly broken.

Before I get into specifics, let me say, no matter what software you are running, it is hard to program around hardware malfunctions. Detecting and halting them is about the best you can do.

You can also do backups and document. At some point we had a configuration disk for our firewall; but when we needed to replace the hard disk, the configuration disk had vanished. This loss cost hours of work time and probably a day of uptime. Having a complete backup of everything, boot disks for all machines, spare cables and disk drives and other assorted parts can make a big difference in the elapsed time to deal with a problem.

Another essential element that we are missing is a monitoring system. If something breaks at 3AM on Sunday, it may not be noticed until 8AM Monday. Being able to detect a problem immediately is crucial to getting it fixed in a timely manner. Even if it is just one person's workstation, having it fixed or replaced before the user comes to work on Monday makes for much higher effective reliability for the network.

One other innovation that we are working on is a backup file server to perform daily backups. Scripts will copy everyone's home directories over to this machine and then its disks will be written to tape. Then, if the main server fails, we can just move this machine in to take over as main.

What's Missing?

One other thing that still seems to be a problem is a reliable Auto Mount Daemon. While we have run AMD semi-successfully, it has proved to be somewhat unstable in our configuration. AMD depends on NIS, and there have been failures caused by the physical network or the yp server.

To address this problem, Jay had an idea. Note that it is still an idea not a design specification, but it is worth mentioning. Here it is in his words:

Anyway, my idea for a new VFS-based automount would be to have a daemon which mounted a special VFS file system similar to PROC and then, through a series of system calls, populated it according to its automount maps; NIS or file or otherwise. Whenever someone descended into a directory, the kernel would send a standard Unix signal to the daemon, which would execute a system call to figure out what it had to do. This would prevent the daemon from having to use shared memory or other system V IPC, and it would allow the daemon to be ported to Berkeley OSes.

For those not familiar with it, VFS is an interface layer between the user and the actual file system being supported. This made implementation of the DOS, System V and other file systems much easier than on systems without VFS. Implementing an auto-mount file system should be relatively easy using VFS, and it could offer an integrated, reliable solution to a problem that plagues many operating systems.

One other thing that would be nice is a journaling file system (JFS). Unix System V, Release 4 has one and it is probably the one place where SVR4 is superior to Linux. As this is an editorial rather than a technical article, let's just say that JFS maintains enough information to reconstruct the exact file system structure after a crash without having to revert to the seek out and destroy sort of approach that fsck uses. This means a lower probability of losing anything and much quicker reboot times after a crash.

That's it from this end. If there are people out there interested in working on either of these last two projects, drop me a line at phil@ssc.com. Maybe we can be better than everyone after all.

______________________

Phil Hughes

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix