Taming the Beast

The right plan can determine the difference between a large-scale system administration nightmare and a good night's sleep for you and your sysadmin team.

As the appetite for raw computing power continues to grow, so do the challenges associated with managing large numbers of systems, both physical and virtual. Private industry, government and scientific research organizations are leveraging larger and larger Linux environments for everything from high-energy physics data analysis to cloud computing. Clusters containing hundreds or even thousands of systems are becoming commonplace. System administrators are finding that the old way of doing things no longer works when confronted with massive Linux deployments. We are forced to rethink common tasks because the tools and strategies that served us well in the past are now crushed by an army of penguins. As someone who has worked in scientific computing for the past nine years, I know that large-scale system administration can at times be a nightmarish endeavor, but for those brave enough to tame the monster, it can be a hugely rewarding and satisfying experience.

People often ask me, “How is your department able to manage so many machines with such a small number of sysadmins?” The answer is that my basic philosophy of large-scale system administration is “keep things simple”. Complexity is the enemy. It almost always means more system management overhead and more failures. It's fairly straightforward for a single experienced Linux sysadmin to single-handedly manage a cluster of a thousand machines, as long as all of the systems are identical (or nearly identical). Start throwing in one-off servers with custom partitioning or additional NICs, and things start to become more difficult, and the number of sysadmins required to keep things running starts to increase.

An arsenal of weapons in the form of a complete box of system administration tools and techniques is vital if you plan to manage a large Linux environment effectively. In the past, you probably would be forced to roll your own large-scale system administration utilities. The good news is that compared to five or six years ago, many open-source applications now make managing even large clusters relatively straightforward.


System administrators know that monitoring is essential. I think Linux sysadmins especially have a natural tendency to be concerned with every possible aspect of their systems. We love to watch the number of running processes, memory consumption and network throughput on all our machines, but in the world of large-scale system administration, this mindset can be a liability. This is especially true when it comes to alerting. The problem with alerting on every potential hiccup is that you'll either go insane from the constant flood of e-mail and pages, or even worse, you'll start ignoring the alerts. Neither of those situations is desirable. The solution? Configure your monitoring system to alert only on actionable conditions—things that cause an interruption in service. For every monitoring check you enable, ask yourself “What action must be taken if this check triggers an alert?” If the answer is “nothing”, it's probably better not to enable the check.

I think it's smart to differentiate monitoring further into critical and noncritical alerts. E-mail and pager alerts should be reserved for things that require immediate action—for example, important systems that aren't pingable, full filesystems, degraded RAIDs and so on. Noncritical things, like NIS timeouts, instead should be displayed on a Web page that can be viewed when you get back from lunch. Also consider writing checks that automatically correct whatever condition they are monitoring. Instead of your script sending you an e-mail when Apache dies, why not have it try restarting httpd automatically? If you go the auto-correcting “self-healing” route, I'd recommend logging whatever action your script takes so you can troubleshoot the failure later.

When selecting a monitoring tool in a large environment, you have to think about scalability. I have seen both Zabbix and Nagios used to monitor in excess of 1,500 machines and implement tens of thousands of checks. Even with these tools, you might want to scale horizontally by dividing your machines into logical groups and then running a single monitoring server per group. This will increase complexity, but if done correctly, it will also prevent your monitoring infrastructure from going up in flames.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Icinga, an open source fork of the Nagios project

Barry Allard's picture

It's is a virtual drop-in replacement for the venerable Nagios project which seems to have stalled into legacy land. It also has a performant, distributed architecture that includes database support for Postgres and Oracle as well as MySQL. It's pretty sweet and easy to deploy too. Something to watch and try out.

Correction: Rock Clusters

Barry Allard's picture

Rocks is primarily for bringing up high-performance compute clusters (HPC/HPCC) quickly not internet applications.

However another open source project, Eucalyptus, is like running a private, on-site Amazon Web Services (AWS). In addition, Eucalyptus can work with Rocks Clusters.

That website mentioned "savvysysadmin.com"

LnxAdm's picture

That website/url to savvy sysadmin ... http://savvysysadmin.com/
doesn't exist. And it's a new article. Maybe an overworked admin missed one of the 500 servers? :-) Just kidding...

nice article

Anonymous's picture

Really a nice article!.
if you use zabbix please consider to use Orabbix to monitor Oracle with zabbix