Chapter 4: Nagios Basics

Chapter 4 - from the book Nagios: System and Network Monitoring by Wolfgang Barth -- Reprinted by permission from No Starch Press and Open Source Press.  Available at booksellers now.  Full book details are at the bottom of the article.
Nagios Basics

The fact that a host can be reached, in itself, has little meaning if no service is running on it on which somebody or something relies.  Accordingly, everything in Nagios revolves around service checks.  After all, no service can run without a host. If the host computer fails, it also cannot provide the desired service.

Things get slightly more complicated if a router, for example, is brought into play, which lies between users and the system providing services.  If this fails, the desired service may still be running on the target host, but it is nevertheless no longer reachable for the user.

Nagios is in a position to reproduce such dependencies and to precisely inform the administrator of the failure of an important network component, instead of flooding the administrator with irrelevant error messages concerning services that cannot be reached.  An understanding of such dependencies is essential for the smooth operation of Nagios, which is why Section 4.1 will look in more detail at these dependencies and the way Nagios works.

Another important item is the state of a host or service.  On the one hand Nagios allows a much finer distinction than just "ok" or "not ok"; on the other hand the distinction between (soft state) and (hard state) means that the administrator does not have to deal with short-term disruptions that have long since disappeared by the time the administrator has received the information.  These states also influence the intensity of the service checks.  How this functions in detail is described in Section 4.3.

4.1 Taking into Account the Network Topology

How Nagios handles dependencies of hosts and services can be best illustrated with an example.  Figure 4.1 represents a small network in which the Domain Name Service on proxy is to be monitored.

Figure 4.1: Topology of an example network

The service check always serves as the starting point for monitoring that is regularly performed by the system.  As long as the service can be reached, Nagios takes no further steps; that is, it does not perform any host checks.  For switch1, switch2, and proxy, such a check would be pointless anyway, because if the DNS service responds to proxy, then the hosts mentioned are automatically accessible.

If the name service fails, however, Nagios tests the computer involved with a host check, to see whether the service or the host is causing the problem.  If proxy cannot be reached, Nagios might test the parent hosts entered in the configuration (Figure 4.2).  With the parents host parameter, the administrator has a means available to provide Nagios with information on the network topology.

Figure 4.2: The order of tests performed after a service failure.

When doing this, the administrator only enters the direct neighbor computer fo each host on the path to the Nagios server as the parent.1 Hosts that are allocated in the same network segment as the Nagios server itself are defined without a parent.  For the network topology from Figure 4.1, the corresponding configuration (reduced to the host name and parent) appears as follows:

define host{
    host_name  proxy
  parents    switch2

define host{
    host_name  switch2
  parents    switch1

define host{
    host_name  switch1

switch1 is located in the same network segment as the Nagios server, so it is therefore not allocated a parent computer.  What belongs to a network segment is a matter of opinion: if you interpret the switches as the segment limit, as is the case here, this has the advantage of being able to more closely isolate a disruption. But you can also take a different view and interpret an IP subnetwork as a segment.  Then a router would form the segment limit; in our example, proxy would then count in the same network as the Nagios server.  However, it would no longer be possible to distinguish between a failure of proxy and a failure of switch1 or switch2.

Figure 4.3: Classification of individual network nodes by Nagios.

If switch1 in the example fails, Figure 4.3 shows the sequence in which Nagios proceeds: first the system, when checking the DNS service on proxy, determines that this service is no longer reachable (1).  To differentiate, it now performs a host check to see what the state of the proxy computer is (2).  Since proxy cannot be reached, but it has switch2 as a parent, Nagios also subjects switch2 to a host check (3).  If this switch also cannot be reached, the system checks its parent, switch1 (4).

If Nagios can establish contact with switch1, the cause for the failure of the DNS service on proxy can be isolated to switch2.  The system accordingly specifies the states of the host: switch1 is UP, switch2 DOWN; proxy, on the other hand, is UNREACHABLE.  Through a suitable configuration of the Nagios messaging system (see Section 12.3 on page 217) you can use this distinction to determine, for example, that the administrator is informed only about the host that is in the DOWN state and represents the actual problem, but not about the hosts that are dependent on the down host.

In a further step, Nagios can determine other topology-specific failures in the network (so-called network outages).  proxy is the parent of gate, so gate is also represented as UNREACHABLE (5).  gate in turn also functions as a parent; the Internet server dependent on this is also classified as "UNREACHABLE".

This "intelligence", which distinguishes Nagios, helps the administrator all the more, the more hosts and services are dependent on a failed component.  For a router in the backbone, on which hundreds of hosts and services are dependent, the system informs administrators of the specific disruption, instead of sending them hundreds of error messages that are not wrong in principle, but are not really of any help in trying to eliminate the disruption.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Suggestion for the book

Anonymous's picture

You might consider a quickstart guide in the book. Most people who purchase a book like this are interested in getting up and running, even in a minimal configuration, first... not memorizing a plethora of detail beforehand.

While manually going through the book, following step-by-step to configure nagios, the daemon complained because there were missing pieces such as defining 24x7 "somewhere" - that's not clearly explained. details like that which can throw a new reader off very easily.

Quote: Although the

Anonymous's picture

Quote: Although the check_interval parameter provides a way of forcing regular host checks, there is no real reason to do this.

This is not true. Example: Mail Server serving up IMAP on port 143 goes DOWN due to having the power go out. When the machine gets turned back on the IMAP service is not turned on by default (or insert whatever scenario that would make the IMAP service non-functional now, iptables, hosts.deny, etc.). Nagios continues to check for port 143 listening on this server and NOT whether the machine responds or not. This machine will continue to show as DOWN as long as the service is non-responsive.

There are only two fixes that I have found for this. 1: Turn on aggressive_host_checking which will kill any machine with more than 1000 active service checks. 2. Use a host checking mechanism as a service. Preferably a quick one icmp packet check.

nice nagios tutorials

prem's picture

this is very easy installation and configuration for Nagios hope this will help more people installing nagios plugins and examples of how to use plugins