Chapter 4: Nagios Basics

Chapter 4 - from the book Nagios: System and Network Monitoring by Wolfgang Barth -- Reprinted by permission from No Starch Press and Open Source Press.  Available at booksellers now.  Full book details are at the bottom of the article.
4.2 Forced Host Checks vs. Periodic Reachability Tests

Service checks are carried out regularly by Nagios, host checks only when needed. Although the check_interval parameter provides a way of forcing regular host checks, there is no real reason to do this.  There is one reason not to do this, however: continual host checks have a considerable influence on the performance of Nagios.

If you nevertheless want to regularly check the reachability of a host, it is better to use a ping-based service check (see Section 6.2 from page 88).  At the same time you will obtain further information such as the response times or possible packet losses, which provides indirect clues about the network load or possible network problems.  A host check, on the other hand, also issues an OK even if many packets go missing and the network performance is catastrophic.  What is involved here--as the name "host check" implies--is only reachability in principle and not the quality of the connection.

4.3 States of Hosts and Services

Nagios uses plugins for the host and service checks.  They provide four different return values (cf. Table 6.1 on page 85): O (OK), 1 (WARNING), 2 (CRITICAL), and 3 (UNKNOWN).

The return value UNKNOWN means that the running of the plugin generally went wrong, perhaps because of wrong parameters.  You can normally specify the situations in which the plugin issues a warning or a critical state when it is started.

Nagios determines the states of services and hosts from the return values of the plugin.  The states for services are the same as the return values OK, WARNING, CRITICAL and UNKNOWN.  For the hosts the picture is slightly different: the UP state describes a reachable host, DOWN means that the computer is down, and UNREACHABLE refers to the state of nonreachability, where Nagios cannot test whether the host is available or not, because a parent is down (see Section 4.1, page 72).

In addition to this, Nagios makes a distinction between two types of state: soft state and hard state.  If a problem occurs for the first time (that is, if there was nothing wrong with the state of a service until now) then the program categorizes the new state initially as a soft state and repeats the test several times.  It may be the case that the error state was just a one-off event that was eliminated a short while later.  Only if the error continues to exist after multiple testing is it then categorized by Nagios as a hard state.  Administrators are informed only of hard states, because messages involving short-term disruptions that disappear again immediately afterwards only add to an unnecessary flood of information.

In our example the chronological sequence of states of a service can be illustrated quite simply.  A service with the following parameters is used for this purpose:


define service{
    host_name               proxy
    service_description     DNS
	...

     normal_check_interval   5
    retry_check_interval    1
    max_check_attempts      5

   ...
}

normal_check_interval specifies at what interval Nagios should check the corresponding service as long as the state is OK or if a hard state exists--in this case, every five minutes.  retry_check_interval defines the interval between two service checks during a soft state--one minute in the example.  If a new error occurs, then Nagios will take a closer look at the service at shorter intervals.

max_check_attempts determines how often the service check is to be repeated after an error has first occurred.  If max_check_attempts has been reached and if the error state continues, Nagios inspects the service again at the intervals specified in normal_check_interval.

Figure 4.4 represents the chronological progression in graphic form: the illustration begins with an OK state (which is always a hard state).  Normally Nagios will repeat the service check at five-minute intervals.  After ten minutes an error occurs; the state changes to CRITICAL, but this is initially a soft state.  At this point in time, Nagios has not yet issued any message.

Now the system checks the service at intervals specified in retry_check_interval, here this is every minute.  After a total of five checks (max_check_attempts) with the same result, the state changes from soft to hard.  Only now does Nagios inform the relevant people.  The tests are now repeated at the intervals specified in normal_check_interval.

Figure 4.4: Example of the chronological progression of states in a monitored service

In the next test the service is again available; thus its state changes from CRITICAL to OK.  Since an OK state is always a hard state, this change is not subject to any tests by Nagios at shorter intervals.

The transition of the service to the OK state after an error in the hard state is referred to as a hard recovery.  The system informs the administrators of this (if it is configured to do so) as well as of the change between various error-connected hard states (such as from WARNING to UNKNOWN).  If the service recovers from an error soft state to the normal state (OK)--also called a soft recovery--the administrators will, however, not be notified.

Even if the messaging system leaves out soft states and switches back to soft states, it will still record such states in the Web interface and in the log files.  In the Web front end, soft states can be identified by the fact that the value 2/5 is listed in the column Attempts, for example.  This means that max_check_attempts expects five attempts, but only two have been carried out until now.  With a hard state, max_check_attempts is listed twice at the corresponding position, which in the example is therefore 5/5.

More important for the administrator in the Web interface than the distinction of whether the state is still "soft" or already "hard", is the duration of the error state in the column Duration.  From this a better judgment can be made of how large the overall problem may be.

For services that are not available because the host is down, the entry 1/5 in the column Attempts would appear, since Nagios does not repeat service checks until the entire host is reachable again.  The failure of a computer can be more easily recognized by its color in the Web interface: the service overview in Figure 4.3 (page 66) marks the failed host in red; if the computer is reachable, the background remains gray.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Suggestion for the book

Anonymous's picture

You might consider a quickstart guide in the book. Most people who purchase a book like this are interested in getting up and running, even in a minimal configuration, first... not memorizing a plethora of detail beforehand.

While manually going through the book, following step-by-step to configure nagios, the daemon complained because there were missing pieces such as defining 24x7 "somewhere" - that's not clearly explained. details like that which can throw a new reader off very easily.

Quote: Although the

Anonymous's picture

Quote: Although the check_interval parameter provides a way of forcing regular host checks, there is no real reason to do this.

This is not true. Example: Mail Server serving up IMAP on port 143 goes DOWN due to having the power go out. When the machine gets turned back on the IMAP service is not turned on by default (or insert whatever scenario that would make the IMAP service non-functional now, iptables, hosts.deny, etc.). Nagios continues to check for port 143 listening on this server and NOT whether the machine responds or not. This machine will continue to show as DOWN as long as the service is non-responsive.

There are only two fixes that I have found for this. 1: Turn on aggressive_host_checking which will kill any machine with more than 1000 active service checks. 2. Use a host checking mechanism as a service. Preferably a quick one icmp packet check.

nice nagios tutorials

prem's picture

this is very easy installation and configuration for Nagios hope this will help more people installing nagios plugins and examples of how to use plugins

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix