Taming the Beast

The right plan can determine the difference between a large-scale system administration nightmare and a good night's sleep for you and your sysadmin team.
Configuration Management

In small environments, you can maintain Linux systems successfully without a configuration management tool. This is not the case in large environments. If you plan on running a large number of Linux systems efficiently, I strongly encourage you to consider a configuration management system. There are currently two heavyweights in this area, Cfengine and Puppet. Cfengine is a mature product that has been around for years, and it works well. The new kid on the block is Puppet, a Ruby-based tool that is quickly gaining popularity. Your configuration management tools should, obviously, allow you to add or modify system or application configuration files to a single system or groups of machines. Some examples of files you might want to manage are /etc/fstab, ntpd.conf, httpd.conf or /etc/password. Your tool also should be able to manage symlinks and software packages or any other node attributes that change frequently.

Regardless of which configuration management tool you use, it's important to implement it early. Managing Linux configurations is something that should be set up as the node is being installed. Retrofitting configuration management on a node that is already in production can be a dangerous endeavor. Imagine pushing out an incorrect fstab or password file, and you get an idea of what can go wrong. Despite the obvious hazards of fat-fingering a configuration management tool, the benefits far outweigh the dangers. Configuration management tools provide a highly effective way of managing Linux systems and can reduce system administration overhead dramatically.

As an added bonus, configuration management systems also can be used as a system backup mechanism of sorts. Granted, you don't want to store large amounts of data in a tool like Cfengine, but in the event of system failure, using a configuration managment tool in conjunction with your node installation tools should allow you to get the system into a known good state in a minimal amount of time.

Provisioning

Provisioning is the process of installing the operating system on a machine and performing basic system configuration. At home, you probably boot your computer from a DVD to install the latest version of your favorite Linux distro. Can you imagine popping a DVD in and out of a data center full of systems? Not appealing. A more efficient approach is to install the OS over the network, and you typically do this with with a combination of PXE and Kickstart. There are numerous tools to assist with large-scale provisioning—Cobbler and Spacewalk are two—but you may prefer to roll your own. Your provisioning tools should be tightly coupled to your configuration management system. The ultimate goal is to be able to sit at your desk, run a couple commands, and see a hundred systems appear on the network a few minutes later, fully configured and ready for production.

Hardware

When it's time to purchase hardware for your new Linux super cluster, there are many things to consider, especially when it comes to choosing a good vendor. When selecting vendors, be sure to understand their support offerings fully. Will they come on-site to troubleshoot issues, or do they expect you to sit for hours on the phone pulling your hair out while they plod through an endless series of troubleshooting scripts? In my experience, the best, most responsive shops have been local whitebox vendors. It doesn't matter which route you go, large corporate or whitebox vendor, but it's important to form a solid business relationship, because you're going to be interacting with each other on a regular basis.

The odds are that old hardware is more likely to fail than newer hardware. In my shop, we typically purchase systems with three-year support contracts and then retire the machines in year four. Sometimes we keep machines around longer and simply discard a system if it experiences any type of failure. This is particularly true in tight budget years.

Purchasing the latest, greatest hardware is always tempting, but I suggest buying widely adopted, field-tested systems. Common hardware usually means better Linux community support. When your network card starts flaking out, you're more likely to find a solution to the problem if 100,000 other Linux users also have the same NIC. In recent years, I've been very happy with the Linux compatibility and affordability of Supermicro systems. If your budget allows, consider purchasing a system with hardware RAID and redundant power supplies to minimize the number of after-hours pages. Spare systems or excess hardware capacity are a must for large shops, because the fact of the matter is regardless of the quality of hardware, systems will fail.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Icinga, an open source fork of the Nagios project

Barry Allard's picture

It's is a virtual drop-in replacement for the venerable Nagios project which seems to have stalled into legacy land. It also has a performant, distributed architecture that includes database support for Postgres and Oracle as well as MySQL. It's pretty sweet and easy to deploy too. Something to watch and try out.

Correction: Rock Clusters

Barry Allard's picture

Rocks is primarily for bringing up high-performance compute clusters (HPC/HPCC) quickly not internet applications.

However another open source project, Eucalyptus, is like running a private, on-site Amazon Web Services (AWS). In addition, Eucalyptus can work with Rocks Clusters.

That website mentioned "savvysysadmin.com"

LnxAdm's picture

That website/url to savvy sysadmin ... http://savvysysadmin.com/
doesn't exist. And it's a new article. Maybe an overworked admin missed one of the 500 servers? :-) Just kidding...

nice article

Anonymous's picture

Really a nice article!.
if you use zabbix please consider to use Orabbix to monitor Oracle with zabbix
www.smartmarmot.com

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix