Puppet and Nagios: a Roadmap to Advanced Configuration
Puppet has provided baked-in Nagios support for a long time now. When combined with Exported Resources, Puppet is well suited to manage an intelligent Nagios configuration where nodes are automatically inventoried and monitored. The excellent Pro Puppet, written by James Turnbull, provides a fairly complete rundown of the installation and configuration steps needed in order to progress in this direction, so I won't repeat the information here. Instead, this article highlights some less-than-optimal default behavior of the Nagios types and details my solution that results in a cleaner filesystem and improved performance.
Not All Resources Should Be Exported!This took me an embarrassingly long time to figure out. Just like resources that are defined in a manifest, Exported Resources must be unique. For example, suppose we have nodes foo and bar, which we'd like to categorize into a Nagios hostgroup named "PMF". At first glance, adding the following code to foo's manifest might seem like the way to go:
@@nagios_hostgroup { "PMF":
ensure => present,
hostgroup_members +> [ $::hostname ]
}
In theory, the resource will be exported to the database when the first node compiles its manifest, but the next node's compilation will complain with a duplicate resource error. For this reason, we will avoid exporting resources created by this particular type. Instead, we will manage our hostgroup memberships via the hostgroup parameter of the nagios_host type.
Had it not been for Pieter Barrezeele's blog (http://pieter.barrezeele.be/2009/05/11/puppet-and-nagios), I may have ended up settling for Puppet's fairly inefficient approach to storing resources managed via its Nagios types. By default, these bits are maintained in hard-coded file paths according to type used. For example, all resources based on the nagios_service type are collected and stored in /etc/nagios/nagios_service.cfg and so on. For performance reasons, I want each collected resource to be stored in its own file path based on the following naming convention:
<base_directory>/<type>_<h3>_<hostname>.cfg
Furthermore, I want my filenames to be composed of all lowercase letters and spaces replaced with underscores. For starters, let's add the bare minimum snippets of code into our manifests in order to export and collect resources using the nagios_host type (Listings 1 and 2).
Listing 1. modules/nagios/manifests/init.pp
# This class will be used by the nagios server
class nagios {
service { nagios:
ensure => running,
enable => true,
}
# Be sure to include this directory in your nagios.cfg
# with the cfg_dir directive
file { resource-d:
path => '/etc/nagios/resource.d',
ensure => directory,
owner => 'nagios',
}
# Collect the nagios_host resources
Nagios_host <<||>> {
require => File[resource-d],
notify => Service[nagios],
}
}
Listing 2. /modules/nagios/manifests/export.pp
# All agents (including the nagios server) will use this
class nagios::export {
@@nagios_host { $::hostname:
address => $::ipaddress,
check_command => 'check_host_alive!3000.0,80%!5000.0,100%!10',
target => "/etc/nagios/resource.d/host_${::hostname}.cfg",
}
}
Note:
Due to the inherent space limitations of published articles, all code will be kept as minimal as possible while conforming to the structure of Puppet Modules. However, no attempt will be made to reproduce a complete module capable of managing a Nagios instance. Instead, I focus on the concepts that have been defined in this article's introduction. Please see http://docs.puppetlabs.com if you need an introduction to Puppet modules.
Let's examine the good and the not-so-good aspects of what we've defined up to this point. On the positive side, all agents will export a nagios_host resource. The Nagios server, upon compiling its manifest, will collect each resource, store it in a unique file, and refresh the Nagios service. At first glance, it may seem like our work is done. Unfortunately, our solution is littered with the following issues and shortcomings:
-
Nagios will not be able to read the newly created .cfg files since the Puppet Agent will create them while running as the root user.
-
There is too much "coordination" needed with the target parameter of the nagios_host type. We should not have to work so hard in order to ensure our target points to the correct file and is void of unpleasant things like spaces and/or mixed case.
-
The address parameter is hard-coded with the value of the ipaddress fact. Although this may be acceptable in some environments, we really should allow for greater flexibility.
-
No ability exists to leverage Nagios hostgroups.
-
Puppet will be unable to purge our exported resources, because we are not using the default behavior of the target parameter.
In order to solve these issues, let's write a new definition to act as a wrapper for all of the Nagios types we plan to use. Before we begin, let's make sure we understand the most important problem—the issue of file ownership and permissions for the newly generated .cfg files. Because these files are created via the target parameter of each associated Nagios type, they'll be written to disk by the user Puppet runs as. This means they will be owned by the root user/group, and Nagios will not have permission to read them (because I know you are not running Nagios as root, correct?). Although some people have chosen to work around this problem by chowning the files via Puppet's exec type, we're going to do something far cleaner in order to maintain Puppet's greatest attribute, abstraction.
After plenty of failed "bright ideas" and confusion on my end, it became clear that it would be quite simple to control the ownership and permissions of each newly created .cfg file if each was managed as a file resource. We can fold the creation of these file resources into our wrapper definition and export them just as we do with the Nagios types. Each file resource then can be defined easily with appropriate properties as well as requiring their corresponding Nagios type. When our Nagios server collects these resources, it first will create the file from the collected Nagios type before managing the file's attributes. Let's examine the new and refactored code.
The nagios::params class:
First, let's define a few variables in a central location. Doing so will aid us in our quest to be "lazy" and not have to match values in various areas of our manifests (Listing 3).
Listing 3. modules/nagios/manifests/params.pp
class nagios::params {
$resource_dir = '/etc/nagios/resource.d'
$user = 'nagios'
case $::operatingsystem {
debian: {
$service = 'nagios3'
}
solaris: {
$service = 'cswnagios'
}
default: {
fail("This module is not supported on $::operatingsystem")
}
}
}
The nagios::resource definition and friends:
Our custom resource definition will serve as a wrapper for all Nagios types. Due to space considerations, the included code covers only the nagios_host and nagios_hostgroup types. Of course, this definition can and should be extended to support every Nagios type we intend to use. Each supported type is represented in its own appropriately named definition 1 level under the nagios::resource namespace. Also included is a nagios::resource::file definition that is responsible for creating the previously mentioned .cfg file (Listings 4–7).
Listing 4. modules/nagios/manifests/resource.pp
define nagios::resource(
$export,
$type,
$host_use = 'generic-host',
$ensure = 'present',
$owner = 'nagios',
$address = '',
$hostgroups = '',
$check_command = ''
) {
include nagios::params
# figure out where to write the file
# replace spaces with an underscore and convert
# everything to lowercase
$target = inline_template("${nagios::params::resource_dir}
↪/${type}_<%=name.gsub(/\\s+/, '_').downcase %>.cfg")
case $export {
true, false: {}
default: { fail("The export parameter must be
↪set to true or false.") }
}
case $type {
host: {
nagios::resource::host { $name:
ensure => $ensure,
use => $host_use,
check_command => $check_command,
address => $address,
hostgroups => $hostgroups,
target => $target,
export => $export,
}
}
hostgroup: {
nagios::resource::hostgroup { $name:
ensure => $ensure,
target => $target,
export => $export,
}
}
default: {
fail("Unknown type passed to this define: $type")
}
}
# create or export the file resource needed to support
# the nagios type above
nagios::resource::file { $target:
ensure => $ensure,
export => $export,
resource_tag => "nagios_${type}",
requires => "Nagios_${type}[${name}]",
}
}
Listing 5. modules/nagios/manifests/resource/file.pp
define nagios::resource::file(
$resource_tag,
$requires,
$export = true,
$ensure = 'present',
) {
include nagios::params
if $export {
@@file { $name:
ensure => $ensure,
tag => $resource_tag,
owner => $nagios::params::user,
require => $requires,
}
} else {
file { $name:
ensure => $ensure,
tag => $resource_tag,
owner => $nagios::params::user,
require => $requires,
}
}
}
Listing 6. modules/nagios/manifests/resource/host.pp
define nagios::resource::host(
$address,
$hostgroups,
$export,
$target,
$check_command,
$use,
$ensure = 'present'
) {
include nagios::params
if $export {
@@nagios_host { $name:
ensure => $ensure,
address => $address,
check_command => $check_command,
use => $use,
target => $target,
hostgroups => $hostgroups ? {
'' => undef,
default => $hostgroups,
},
}
} else {
nagios_host { $name:
ensure => $ensure,
address => $address,
check_command => $check_command,
use => $use,
target => $target,
require => File[$nagios::params::resource_dir],
hostgroups => $hostgroups ? {
'' => undef,
default => $hostgroups,
},
}
}
}
Listing 7. modules/nagios/manifests/resource/hostgroup.pp
define nagios::resource::hostgroup(
$target,
$ensure = 'present',
$hostgroup_alias = '',
$export = false
) {
include nagios::params
if $export {
fail("It is not appropriate to export the Nagios_hostgroup
↪type since it will result in duplicate resources.")
} else {
nagios_hostgroup { $name:
ensure => $ensure,
target => $target,
require => File[$nagios::params::resource_dir],
}
}
}
Listing 8 shows our refactored nagios::export class that is meant to be used by all nodes. Notice how we no longer leverage the nagios_host type directly. Instead, we call upon our newly created nagios::resource definition. Both the address and hostgroups parameters will use sane defaults unless they are overridden with node scoped variables. Also, notice how the target parameter is no longer required, as our nagios::resource definition performs the heavy lifting for us.
Listing 8. modules/nagios/manifests/export.pp
# All agents (including the nagios server) will use this
class nagios::export {
nagios::resource { $::hostname:
type => 'host',
address => inline_template("<%= has_variable?('my_nagios_interface') ?
↪eval('ipaddress_' + my_nagios_interface) : ipaddress %>"),
hostgroups => inline_template("<%= has_variable?('my_nagios_hostgroups') ?
↪$my_nagios_hostgroups : 'Other' %>"),
check_command => 'check_host_alive!3000.0,80%!5000.0,100%!10',
export => true,
}
}
As you can see, the nagios::export class is ready to be extended with any kind of resource supported by our nagios::resource definition. Whenever we want all clients to export a particular resource, we just add it here so long as the following requirements are met:
-
The resource name must be unique.
-
The type parameter must be set.
-
The export parameter must be set to a value of true.
Now that all of our agents are exporting a nagios_host resource, we can focus on the collection side of things.
Tip: Short-and-Sweet Nagios Service DescriptionsEfficient Service Names in Nagios
When you get around to extending nagios::resource with support for the nagios_service type, you may want to consider using an inline ERB template to handle the service_description parameter. The following code removes the last word (which should be the hostname) from the description displayed in Nagios:
service_description => inline_template("<%= name.gsub(/\\w+$/,
↪'').chomp(' ') %>"),
Now, a resource defined with a unique title, such as "Puppet Agent $::hostname", is displayed as "Puppet Agent" in Nagios.
Expire, Collect and Purge Exported ResourcesUp until this point, the job of our Nagios server simply has been to collect exported resources. In the real world, the nodes it monitors are retired for one reason or another quite routinely. When a node is retired, I want to be sure the relevant Nagios objects are removed and the corresponding database records are deleted. According to Puppet's documentation, these resources can be purged from the collector only when default target locations are leveraged (http://docs.puppetlabs.com/references/stable/type.html#nagioshost). Even so, I wasn't happy to see orphaned database records left behind and decided to address this issue with a few Puppet functions and some basic class ordering. Before we dive in, some work flow and terminology must be understood:
-
Expire: a Nagios resource is "expired" by setting the value of its "ensure" parameter to "absent".
-
Collect: the resource is removed from the collector due to the value of its "ensure" parameter.
-
Purge: all database records associated with the expired host are deleted.
Ordering is obviously a big deal here. In order to ensure proper execution of each task, we will break out each unit of work into its own class and use a mix of "include" and "require" functions. Using Puppet terminology, we now can express this "expire, collect, then purge" work flow as follows:
-
The nagios class requires the nagios::expire_resources class.
-
The nagios class includes the nagios::purge_resources class.
-
The nagios::purge_resources class requires the nagios::collect_resources class.
Now, let's look at a few custom functions, expire_exported and purge_exported. These functions (written for PostgreSQL) perform the database operations that are required in order to expire hosts and their resources. They both operate on a node-scoped variable named $my_nagios_purge_hosts, which should contain an array of hostnames. If used, this variable should be placed somewhere in your Nagios server's node definition. For example:
node corona {
$my_nagios_purge_hosts = [ 'foo', 'bar', 'baz' ]
include nagios
}
With this node-scoped variable defined, your (affectionately named) Nagios server will reconfigure itself after dropping all resources for the three hosts mentioned above (Listings 9 and 10).
Listing 9. nagios/lib/puppet/parser/functions/expire_exported.rb
Puppet::Parser::Functions::newfunction(
:expire_exported,
:doc => "Sets a host's resources to ensure =>
↪absent as part of a purge work-flow.") do |args|
require 'rubygems'
require 'pg'
require 'puppet'
raise Puppet::ParseError, "Missing hostname." if args.empty?
hosts = args.flatten
begin
conn = PGconn.open(:dbname => 'puppet', :user => 'postgres')
hosts.each do |host|
Puppet.notice("Expiring resources for host: #{host}")
conn.exec("SELECT id FROM hosts WHERE name =
↪\'#{host}\'") do |host_id|
raise "Too many hosts" if host_id.ntuples > 1
conn.exec("SELECT id FROM param_names WHERE name =
↪'ensure'") do |param_id|
conn.exec("SELECT id FROM resources WHERE host_id =
↪#{host_id.values.flatten[0].to_i}") do |results|
resource_ids = []
results.each do |row|
resource_ids << Hash[*row.to_a.flatten]
end
resource_ids.each do |resource|
conn.exec("UPDATE param_values SET VALUE =
↪'absent' WHERE resource_id = #{resource['id']} AND
↪param_name_id = #{param_id.values}")
end
end
end
end
end
rescue => e
Puppet.notice(e.message)
ensure
conn.close
end
end
Listing 10. nagios/lib/puppet/parser/functions/purge_exported.rb
# This function will be used by the exported
# resources collector (the nagios box)
Puppet::Parser::Functions::newfunction(:purge_exported,
↪:doc => "delete expired resources.") do |args|
require 'rubygems'
require 'pg'
require 'puppet'
raise Puppet::ParseError, "Missing hostname." if args.empty?
hosts = args.flatten
begin
conn = PGconn.open(:dbname => 'puppet', :user => 'postgres')
hosts.each do |host|
Puppet.notice("Purging expired resources for host: #{host}")
conn.exec("SELECT id FROM hosts WHERE name =
↪\'#{host}\'") do |host_id|
raise "Too many hosts" if host_id.ntuples > 1
conn.exec("SELECT id FROM resources WHERE host_id =
↪#{host_id.values.flatten[0].to_i}") do |results|
resource_ids = []
results.each do |row|
resource_ids << Hash[*row.to_a.flatten]
end
resource_ids.each do |resource|
conn.exec("DELETE FROM param_values WHERE
↪resource_id = #{resource['id']}")
conn.exec("DELETE FROM resources WHERE id =
↪#{resource['id']}")
end
end
conn.exec("DELETE FROM hosts WHERE id =
↪#{host_id.values}")
end
end
rescue => e
Puppet.notice(e.message)
ensure
conn.close
end
end
And, now for the refactored nagios class and related code (Listings 11–14).
Listing 11. modules/nagios/manifests/init.pp
# This class will be used by the nagios server
class nagios {
include nagios::params
require nagios::expire_resources
include nagios::purge_resources
service { $nagios::params::service:
ensure => running,
enable => true,
}
# nagios.cfg needs this specified via the cfg_dir directive
file { $nagios::params::resource_dir:
ensure => directory,
owner => $nagios::params::user,
}
# Local Nagios resources
nagios::resource { [ 'Nagios Servers', 'Puppet Servers', 'Other' ]:
type => hostgroup,
export => false;
}
}
Listing 12. modules/nagios/manifests/expire_resources.pp
class nagios::expire_resources {
if $my_nagios_purge_hosts {
expire_exported($my_nagios_purge_hosts)
}
}
Listing 13. modules/nagios/manifests/purge_resources.pp
class nagios::purge_resources {
require nagios::collect_resources
if $my_nagios_purge_hosts {
purge_exported($my_nagios_purge_hosts)
}
}
Listing 14. modules/nagios/manifests/collect_resources.pp
class nagios::collect_resources {
include nagios::params
Nagios_host <<||>> {
require => $nagios::params::resource_dir,
notify => Service[$nagios::params::service],
}
File <<| tag == nagios_host |>> {
notify => Service[$nagios::params::service],
}
}
The basic building blocks are now in place. Extend nagios::resources, plug the classes in to your nagios module and kick back. If a node goes MIA and needs to be purged, toss it into your $my_nagios_purge_hosts array and be done with it. Until next time, may your Nagios dashboards be green and your alerts be few.