Linux's Tell-Tale Heart, Part 4

The Numbers Game and Your Web Server Logs

Yes, it's time for another installment of the SysAdmin's Corner, home of all things Linux. Okay, not all things, just those that are in some way technical and in some way related to running, maintaining and otherwise administering your Linux system. (That still sounds like a fair bit.) Today, we dive once more into the heart and soul of your Linux system to find out just what the heck it is trying to tell us. Yes, this is the wonderful world of logs and log files.

After last week's article on PortSentry, I received several e-mails asking how one can identity a potential cracker's ISP based on what is often a numerical address. Before we leap into today's column, I'd like to take a moment to offer some suggestions on this. One way to find out just where a cracker is coming from is to do a traceroute on the address. The format is simple:

     traceroute xxx.xxx.xxx.xxx -i ppp0

The -i flag specifies the interface through which the traceroute occurs. If you do not have multiple interfaces, meaning you are one PC on a masqueraded network (or something like that), then you can skip the -i flag. A dial-up connection would follow the above format. While traceroute output can be quite interesting and will sometimes yield the identity of the ISP (somewhere near the end of the chain), in most cases you'll find yourself scratching your head. If the cracker's IP address does not resolve with a simple host XXX.XXX.XXX.XXX command, I then search for who owns the address space of that IP address by visiting the ARIN (American Registry for Internet Numbers) whois database at http://www.arin.net/whois/index.html.

Entering an IP address into the search form will return the block of IP addresses that correspond to the one you just entered, including the owner of that block. You can then click on the owner and find the domain record for that person or organization, along with contact information and so on. Even armed with the technical contact's name, I usually also send mail to abuse@thedomain.com (where "thedomain" is, oddly enough, the domain) as well as postmaster@thedomain.com. The odds are pretty good that you will get the right person.

You won't find everything on the ARIN database. For instance, military IPs aren't listed with ARIN. You will, however, be able to locate just about anything else; either there, or through one of the links to other whois databases.

Let's briefly turn our attention away from potential crackers and examine another set of voluminous logs. If you are running a web server, you might have noticed that your logs can be a bit busy. Every time someone requests a page, you get at least one entry in your web server's log files. Why "at least" one, rather than only one? In all likelihood, your pages aren't simple text documents. You may have a half-dozen graphics and a web counter in addition to the actual text. Each and every one of those will generate a log entry. Perhaps your web page is designed using frames 3 even more log entries.

Before you can examine this wealth of information, you need to know where to look. For a Red Hat RPM installation, take a look in /etc/httpd/logs. If, like myself, you built your own Apache server from source, then you should probably be looking in your /usr/local/apache/logs directory. Either way, the file to look at iis access_log. Another way that will yield the information you need, no matter what system you are running, is the command

     httpd -V

This will print out the configuration settings of your web server binary. The piece of information you want will look like this:

     -D DEFAULT_XFERLOG="/path_to/your/access_log"

As I mentioned a couple paragraphs back, every link, image or piece of text is in that file, along with the address of the site which clicked on it. That's a lot of information, but how can you tell which page is your most popular? What do visitors look for on your site? Here's an interesting thought ... if you could count the types of browsers that visit your site, you might make some changes in your page layout to accommodate the majority of visitors. Viewing those logs with commands such as cat, more or less will not make your job any easier.

You need something like Analog, from Stephen Turner at the University of Cambridge Statistical Laboratory. The web page for Analog proudly proclaims it as "the most popular log file analiser in the world". While I am unable to confirm or deny this claim, I can provide the web site address so you can pick up your own copy. (For those of you who want to check those numbers out for yourselves, Stephen has also provided links to the research.)

     http://www.statslab.cam.ac.uk/~sret1/analog/

Analog can sift through literally thousands of web site log entries and generate an easy-to-read report in very little time. I took an old access_log file from my web site, and let analog do its thing. On my old 150MHz Pentium notebook, Analog processed 800,000 lines of log in 1 minute and 39 seconds. That's reasonably impressive. At the time I downloaded my copy, the latest version was 4.11. For those of you who run other architectures besides Linux (maybe a Solaris or BSD web server), this program will compile on many different UNIX platforms.

You should extract Analog into the directory structure where you will eventually want it to live. Normally, you do your "make" and "make install", then delete the installation directory. Not so here. If I wanted Analog to live in /usr/local/analog-4.11, I could change to the /usr/local directory and perform the following steps:

     cd /usr/local
     tar -xzvf analog-4.11.tar.gz
     cd analog-4.11

Before you go ahead and type "make", you will want to edit the anlghead.h file. The most obvious changes are the following lines:

     #define HOSTNAME "[my organization]"
     #define HOSTURL "none"
     #define ANALOGDIR "/usr/local/analog-4.11/"
     #define LOGFILE "/usr/local/apache/logs/access_log"

The HOSTNAME variable refers to a banner that will be displayed at the top of your HTML report. This can be pretty much anything; if you have only one web server, your company or organization name is an obvious choice. If you have several web servers, you might want to identify the specific host. Next we have HOSTURL, which simply puts a link back to your home page (or whatever page you would like) on the report head. ANALOGDIR is the path to the Analog program and support files. We decided on this when we extracted the program. Finally, LOGFILE is the full path to your web server log file; in my case, the path is /usr/local/apache/logs/access_log. Now you are ready to make the program.

     make

Before running the program for the first time, you are going to edit one more thing. This one is in the installation directory for the program and is called analog.cfg. The directory examples contains variations of "cfg" files with ideas on how you might set yours up. Quite a number of parameters can be set in analog.cfg, which will allow you to customize the report generated to better suit your site. A (very small) sample of those parameters and their meanings are listed below.

   LANGUAGE ENGLISH  (My choice, although I did try FRENCH)
   HOST ON           (The hostname of their system)
   DOMAIN ON         (Country codes get listed)
   REFERRER ON       (What page did they come from to find you)
   SEARCHQUERY ON    (Search terms used to find you)
   BROWSER ON        (What kind of browsers were they using)
   OSREP ON          (What were they running, Linux, Windows, etc)

You'll find tons more if you check out the big.cfg file in the examples directory, since it contains pretty much anything you could possibly want to consider. I should point out that in order for Analog to report certain things, you will have to make sure your web server is actually collecting this information. For instance, my server collects IP addresses of visitors and does not do a DNS lookup for each connect. This is the default Apache configuration these days, and it is done for speed. Consequently, I would not get HOSTNAME or DOMAIN reports on my system. In case you are curious, that parameter is "HostnameLookups on" in your Apache configuration file. Beware the cost, though. The default is "off" for good reason.

Anyhow, I digress. To run the Analog program, type either the full path to the command or execute it from the installation directory.

     cd /usr/local/analog-4.11
     ./analog

Wait a short time, and you will have your report. You can then view it in your favorite web browser. If you are running this locally on your system, you can simply type in the path name to the report in your browser's location bar. If you would like to see an example of an Analog report (before installing it), click the see a sample report link from the Analog home page.

Analog is cool. It's fast. It's free (although not GPL - read the included license for details).

Well, everyone, it's time again to wrap it up for another week. Next time around, I'm going to show you how to throw a little color, a little flash if you will, into the otherwise flat world of log files and system analysis. If USA Today-like color charts do it for you, then make sure to be here when next we reconvene at the SysAdmin's Corner. In the meantime, remember that your Linux system is talking to you. Are you listening?

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix