Further Tales of Horror
First, I should fill in the background. I'm not, by training, a system administrator. However, in the past I have held a position under that title for a SUN-based network running Solaris, and for the past two years I have been an NT administrator. Despite having trained in Computer Science and spent most of my working life doing software engineering with UNIX, it is one of life's little ironies that I now find myself a system administrator for a network of PCs running Windows and a couple of NT servers. In stark contrast to system administration using UNIX, the same task for Windows NT is a nail-biting, stress-filled rollercoaster ride of reboots, re-installs and frenzied web searches for answers to inexplicable problems.
The task at hand sounded simple enough: we needed to upgrade our antiquated MS Mail Server to Microsoft Exchange (MSE), with which I had some nasty experiences in the past. MSE is difficult to configure and one of the most unnecessarily complicated pieces of software I have encountered, and I try to avoid it wherever possible. Despite my protestations, the powers that be had decreed MSE was the way to go, so away we went.
Now, the current configuration was clearly not adequate, but it was stable after a fashion. The server was configured to use a periodic dial-up network connection to the main company server in Geneva from our site in Bangkok. The connection was done using MS Mail's own dial-out scheduler, which did not pay any attention to the dial-up networking (DUN) or remote access service (RAS) or anything else now "standard" on Microsoft servers. Ironically, this made MS Mail quite robust, since it was not affected by modem driver problems, routing problems and so on. It just connected directly to the port and ran its own script.
With my copy of Windows NT 4.0 Unleashed in hand, I set out to do the reconfiguration. First, I needed to configure dial-up networking to connect to the Geneva server. This was complicated by the fact that the connection was not to be made over the Internet via a local ISP, but via the proprietary SITA network. Needless to say, getting the configuration working properly and logging into the Geneva site was a tale in itself, but I'll leave that out since it's not really about Windows NT. After getting the connection up, Geneva reported I had connected okay, and they could see our network. In addition, their administrators could connect to our running MSE server and view the configuration. Curiously enough, however, they were unable to make any configuration changes, despite having the appropriate account privileges. The same applied to making changes to the dreaded Windows NT registry.
After playing with the system for about eight hours, I stumbled across the possible problem, written as one sentence in the Unleashed book and relating to a service called routing and remote access--a king of RAS++. The book said, "[when connecting with RRAS] of course you need another RRAS connection at the other end".
Of course! What was I thinking? Despite the fact that RAS is used for dial-up to ISPs using standard protocols, when it talks to another RAS service it will need to do something proprietary. This is vintage Microsoft--take a standard and add something extra that breaks it. In addition, it will be incompatible with RRAS, or so I'm told.
Our administrators in Geneva agreed the problem lay with the fact that they were running RRAS at their end and we were running RAS at our end. The documentation had led me to believe RRAS was just RAS with demand-dial routing and other rings and bells. I was assured, however, there were great mysteries in the world of RRAS to which I had not been initiated, and it was just better to upgrade. Then the fun began. I noted that, despite the extra features, RRAS doesn't appear to do anything useful, like network address translation, that would make it really useful as a demand-dial router for a network to an ISP.
So I began the process of getting RRAS up and running. First, I had to remove the existing RAS service and associated bits. Naturally, I am asked if I would like to reboot. Then I install and configure the RRAS service--another reboot. Since RRAS uses a different connection to RAS, I need to reconfigure the connection for the connection to Geneva, then I tried to connect. Blue screen of death. I am completely unsurprised by this, of course, since it is a standard occurrence hen making any major changes to an NT server. Unphased, I review the situation and remember the server also has a demand-dial internet connection, handled by a very nice piece of software called Sygate. It seems a reasonable assumption that Sygate is somehow interfering with RRAS. I uninstall it and, of course, reboot. I try the connection, but the blue screen of death bites again. I check for interrupt/driver conflicts, but everything seems fine. I check the requirements for installation in the RRAS documentation again, and everything seems fine. Then I recall a snippet of information from a conversation a couple of weeks previously; you must re-install the service pack after installing RRAS, even though this is not in the Microsoft supplied documentation. I re-install the service pack and reboot. I try the connection again, this time I get the dreaded error 720: No PPP protocols configured. This is quite a well-known error message for NT administrators that usually occurs precisely when you do have PPP protocols configured. I follow the troubleshooting documentation that I have fortunately downloaded and printed out ahead of time, and it recommends three different solutions. To cut a long story short, I try all three and--you guessed it--blue screen of death.
Next, I recalled the server has two modems, an external modem for the mail connection and an internal WinModem for the internet service. WinModems are usually dumb modem chips with some extras, and most of the work is done by the CPU, using the modem driver. This is a obvious possibility for trouble, so I uninstall the modem driver, which then necessitates reconfiguring RRAS, which then means, of course, another reboot. I try the connection again--blue screen of death. I uninstall the other modem's driver and use the generic driver. Blue screen of death. I then decide that I can't figure the problem out by myself, I'll just have to jump on the internet and...D'oh! I've trashed my internet connection, so that's no good. Out comes the laptop, and I go to dogpile and do search. Why not go to Microsoft support, I hear you say ? Well, that's one possibility, and sometimes the answers are there. However, for the most part, the answers are buried in large amounts of guff that serves little or no purpose. The documentation is written from the point-of-view of someone trying to cover their tail, rather than someone actually trying to help solve a problem. In my experience, the Microsoft web site does not document anything that makes them look bad, despite how useful such information might be for solving problems. Better to look elsewhere.
Sure enough, one of the discussion groups mentions a similar problem. I look for similarities and find service pack 5 was involved, the same as our server. I decide to upgrade to service pack 6. Of course, those in the know are aware that service pack 6 is unstable, and NT now requires service pack 6a (as opposed to service pack 7?). Anyway, I don't have service pack 6a, and I don't have the time to download the single 40M file from the web site to my laptop over my 40K dial-up connection. What I do have is service pack 6 plus the hotfix that turns it into service pack 6a. I install service pack 6, reboot, install the hotfix, reboot again. I start the RRAS connection and get the blue screen of death. Meanwhile, a message alert has appeared telling me that one or more services have failed to start.
Examining the logs, I find the backup software is complaining about something. Since the service pack upgrade has availed me nothing, I'll just uninstall it and be back where I was before, right ? Wrong. I uninstall the service pack and reboot, only to find another alert. This time it isn't the backup software, which is now very happy, but a service called the protected services manager, and it reports a corruption. Fortunately, I've encountered this before, and it's very nasty. The Microsoft documentation says that once this happens your system is basically trashed, and you need to re-install the OS from scratch. However, the last time this was due to re-assigning a drive letter on a disk partition. Apparently, the drive letter was part of a whole set of paths in the registry, and when the drive letter was re-assigned, the OS thoughtfully left the registry entries as they were. I was able to search the registry and replace all occurrences of these paths. This time, he problem turns out to be more complicated, and after a bit of tinkering, I am forced to make the inevitable decision: re-install Windows NT from scratch. This has happened before, however, and I am prepared with an unattended installation script, which I use to re-install the OS. Meanwhile, I head off to grab a coffee and try to placate my users who are all complaining they can't read e-mail, print or surf the Internet.
Eventually, I finish the reinstall, rebooting three times. I then install service pack 6, reboot, the hotfix, reboot, RRAS, reboot, service pack 6 again, reboot, the hotfix, reboot. I use the same modem drivers, but I don't install anything else like Sygate. This time, when I configure the connection it comes up without a complaint. I never did figure out what the problem was. It could have been the modem drivers, Sygate, service pack 5, something else in the original installation or all of the above.
In the end, I have spent days resolving the problem and the associated network outage, and this only gets me to the stage where I am able to address the MSE configuration, a whole new can of worms. Let's review for a moment: I have spent half a week trying to get the server to connect to another server. That's it. Nothing special, just a normal TCP/IP connection with the Microsoft add-ons. This is not an unusual story. While this is probably difficult to relate to for most of you who are smart enough to avoid Windows NT and use something like Linux, it is a completely typical experience for an NT administrator. It is normal to have a server crash and/or reboot whenever you make a configuration change or install a new service. It is normal that error messages from such events are obscure and uninformative, and usually poorly documented or not documented at all. It is normal that changing a driver or a service can lead to the entire system collapsing like a house of cards. This is why NT administrators are so reluctant to make any kind of change to a stable system.
At home, of course, I run Linux and have done so since 1993. Wherever I can, I try to avoid Windows in the workplace, but this is becoming increasingly difficult. Managers in large companies, confused by the exponentially increasing range of technical issues and confounded by the Microsoft PR machine, are taking the easy road nd just opting for Microsoft. Despite the increasing popularity of Linux, Microsoft is still the number one desktop solution and has a large and increasing share of the server market. I believe two reasons for this are the lack of awareness of what is involved in configuring and maintaining a Windows NT server and lowered expectations for what a server can provide. For those of us in the UNIX world, having to reboot your machine every time you reconfigure an interface or install a service is absurd. Having a server inexplicably crash without warning and leaving little or no information to diagnose and remedy the problem is absurd. Having such a fragile balance of software interdependencies that altering any one of them can lead to a system corruption so bad it requires re-installing the operating system is absurd.
In contrast, a well-configured Linux server is able to stand almost any configuration change, except hardware changes or kernel upgrades, without service interruption. Services can be installed, started and uninstalled without affecting operations. Similarly, interfaces can be brought up and down and reconfigured. Modules can be loaded and unloaded from the kernel. High-end UNIX servers, such as the HP9000 series, can even hot-swap motherboard components like memory and interface cards. Multiple CPU servers can tolerate CPU replacement without interruption. This is what a server is about, uptimes measured in years not days. Robust configurations you can back out of if need be. Security, stability and control.
This is not the experience you get when you use Windows NT Server. However, with Windows 2000 now out in the marketplace, the current assertion is that all of the problems we encountered with Windows NT have now been solved. Didn't we hear this when NT 4.0 was first released ? Needless to say, with every problem Windows 2000 has solved, it introduces another, along with a plethora of arcane and unnecessarily complicated new concepts that have already spawned a score of books that attempt to explain them. All this is supposedly to provide us with the ultimate server solution. I have a better solution--switch to Linux.
Brian Lowe did his honours degree in computational linguistics at the University of Western Australia and acquired a PhD in database systems at RMIT University. He has worked as a software engineer for about six years and he continues his research in the Multimedia Database Systems group at RMIT University. He currently lives and works in Bangkok. (http://www.mds.rmit.edu.au/~lowe ).