Taming the Beast
As the appetite for raw computing power continues to grow, so do the challenges associated with managing large numbers of systems, both physical and virtual. Private industry, government and scientific research organizations are leveraging larger and larger Linux environments for everything from high-energy physics data analysis to cloud computing. Clusters containing hundreds or even thousands of systems are becoming commonplace. System administrators are finding that the old way of doing things no longer works when confronted with massive Linux deployments. We are forced to rethink common tasks because the tools and strategies that served us well in the past are now crushed by an army of penguins. As someone who has worked in scientific computing for the past nine years, I know that large-scale system administration can at times be a nightmarish endeavor, but for those brave enough to tame the monster, it can be a hugely rewarding and satisfying experience.
People often ask me, “How is your department able to manage so many machines with such a small number of sysadmins?” The answer is that my basic philosophy of large-scale system administration is “keep things simple”. Complexity is the enemy. It almost always means more system management overhead and more failures. It's fairly straightforward for a single experienced Linux sysadmin to single-handedly manage a cluster of a thousand machines, as long as all of the systems are identical (or nearly identical). Start throwing in one-off servers with custom partitioning or additional NICs, and things start to become more difficult, and the number of sysadmins required to keep things running starts to increase.
An arsenal of weapons in the form of a complete box of system administration tools and techniques is vital if you plan to manage a large Linux environment effectively. In the past, you probably would be forced to roll your own large-scale system administration utilities. The good news is that compared to five or six years ago, many open-source applications now make managing even large clusters relatively straightforward.
System administrators know that monitoring is essential. I think Linux sysadmins especially have a natural tendency to be concerned with every possible aspect of their systems. We love to watch the number of running processes, memory consumption and network throughput on all our machines, but in the world of large-scale system administration, this mindset can be a liability. This is especially true when it comes to alerting. The problem with alerting on every potential hiccup is that you'll either go insane from the constant flood of e-mail and pages, or even worse, you'll start ignoring the alerts. Neither of those situations is desirable. The solution? Configure your monitoring system to alert only on actionable conditions—things that cause an interruption in service. For every monitoring check you enable, ask yourself “What action must be taken if this check triggers an alert?” If the answer is “nothing”, it's probably better not to enable the check.
If you were asked to name the first monitoring application that comes to mind, it probably would be Nagios. Used by just about everyone, Nagios is currently the king of open-source monitoring tools.
Zabbix sports a slick Web interface that is sure to make any manager happy. Zabbix scales well and might be posed to give Nagios a run for its money.
Ganglia is one of those must-have tools for Linux environments of any size. Its strengths include trending and performance monitoring.
I think it's smart to differentiate monitoring further into critical and noncritical alerts. E-mail and pager alerts should be reserved for things that require immediate action—for example, important systems that aren't pingable, full filesystems, degraded RAIDs and so on. Noncritical things, like NIS timeouts, instead should be displayed on a Web page that can be viewed when you get back from lunch. Also consider writing checks that automatically correct whatever condition they are monitoring. Instead of your script sending you an e-mail when Apache dies, why not have it try restarting httpd automatically? If you go the auto-correcting “self-healing” route, I'd recommend logging whatever action your script takes so you can troubleshoot the failure later.
When selecting a monitoring tool in a large environment, you have to think about scalability. I have seen both Zabbix and Nagios used to monitor in excess of 1,500 machines and implement tens of thousands of checks. Even with these tools, you might want to scale horizontally by dividing your machines into logical groups and then running a single monitoring server per group. This will increase complexity, but if done correctly, it will also prevent your monitoring infrastructure from going up in flames.