Work the Shell - Analyzing Log Files

Ever wondered what your Web server is doing, but find that you don't have a stats or analytics package installed? In fact, analyzing log files is a perfect task for the Linux command line and, by extension, shell scripts too.

If you're running Apache, and you probably are, you've got a file called access_log on your server, probably in /etc/httpd or some similar directory. Find it (you can use locate or find if needed).

First, let's see how many hits you've received—that is, how many individual files have been served up. Use the wc program to do this:

$ wc -l access_log
   83764 access_log

Interesting, but is that for an hour or a month? The way to find out is to look at the first and last lines of the access_log itself, easily done with head and tail:

$ head -1 access_log
140.192.64.26 - - [11/Jul/2006:16:00:59 -0600]
 ↪"GET /favicon.ico HTTP/1.1" 404 36717 "-" "-"
$ tail -1 access_log
72.82.44.66 - - [11/Jul/2006:22:15:14 -0600]
 ↪"GET /individual-entry-javascript.js HTTP/1.1"
 ↪200 2374 "http://www.askdavetaylor.com/
↪sync_motorola_razr_v3c_with_windows_xp_via_bluetooth.html"
↪"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1;
↪.NET CLR 1.1.4322; .NET CLR 2.0.50727)"

These log file lines can be darn confusing, so don't panic if you look at that and become completely baffled. The good news is it's not important to know what every field details. In fact, all we care about is the date and time in square brackets, and the name of the individual file requested after the “GET” line.

Here you can see that the first line in the access log is from 11 July at 16:00:59 and the last line is from 11 July at 22:15:14. Calculate this out, and we're talking a window of about six hours and 15 minutes, or 375 minutes. Divide the total number of hits by this time passage, and we're seeing 223 hits per minute, or a pretty impressive traffic level of 3.7 hits per second.

The Most Popular Files Sent

The second common query is to ascertain which files are requested most often, and that's something we can ascertain with a quick call to awk to split that field from the log file lines, then a combination of sort and uniq with its ever-useful -c option.

Let's take this one step at a time.

If you go back to the log file line shown above, you'll find that it's the seventh field that contains that value, meaning we can extract it like this:

$ head access_log | awk '{print $7}'
/favicon.ico
/0-blog-pics/itunes-pc-advanced-importing-prefs.png
/0-blog-pics/itunes-pc-importing-song.png
/styles-site.css
/individual-entry-javascript.js
/motorola_razr_v3c_and_mac_os_x_transfer_pictures_and_wallpaper.html
/Graphics/header-paper2.jpg
/Graphics/pinstripebg.gif
/0-blog-pics/bluetooth-razr-configured.png
/0-blog-pics/itunes-pc-library-sting.png

When you have a long list of data like this, you can figure out the most popular individual occurrences by sorting everything, then using the uniq command to figure out how often each line occurs. Then use sort again, this time to sort the data from that, prefaced with the largest numeric value to the smallest.

Here's an intermediate result to help you see what's happening:

$ awk '{print $7}' access_log | sort | uniq -c | head
 535 /
26 //favicon.ico
   6 //signup.cgi
   1 /0-blog-pics/MVP-Combo_picture.jpg
   2 /0-blog-pics/address-book-import.jpg
   4 /0-blog-pics/adwords-psp-bids.png
  28 /0-blog-pics/aim-congrats-account.png
  28 /0-blog-pics/aim-create-screen-name.png
  38 /0-blog-pics/aim-delete-screenname-mac.png
  29 /0-blog-pics/aim-forget-password.png

All that's left is to sort it by most popular and axe all but the top few matches:

$ awk '{print $7}' access_log | sort | uniq -c | sort -rn | head
6176 /favicon.ico
5807 /styles-site.css
5733 /Graphics/header-paper2.jpg
5655 /Graphics/pinstripebg.gif
5512 /individual-entry-javascript.js
5458 /Graphics/marker-tray.gif
5366 /Graphics/help-button.jpg
5363 /Graphics/digman.gif
5359 /Graphics/delicious.gif
5323 /0-blog-pics/starbucks-hot-coffee.jpg

The first thing you'll notice is that this isn't pages but graphics. That's not a surprise, because just like most Web sites, my own AskDaveTaylor.com has graphics shared across all pages, making the graphics more frequently requested than any given HTML page.

Fortunately, we can force the results to be HTML pages by simply using the grep program to filter the final results of the filter sequence:

$ awk '{print $7}' access_log | sort | uniq -c | sort -rn
 ↪| grep "\.html" | head
 446 /motorola_razr_v3c_and_mac_os_x_transfer_pictures_and_wallpaper.html
 355 /how_to_create_new_screen_names_on_aol_america_online.html
 346 /how_do_i_cancel_my_america_online_aol_account.html
 293 /pc_to_sony_psp_how_do_i_download_music.html
 206 /how_do_i_get_photos_and_music_onto_my_sony_psp.html
 198 /how_do_i_get_my_wireless_wep_password_for_my_sony_psp.html
 195 /cant_get_standalone_music_player_to_work_on_myspace.html
 172 /convert_wma_from_windows_media_player_into_mp3_files.html
 166 /sync_motorola_razr_v3c_with_windows_xp_via_bluetooth.html
 123 /how_do_i_create_a_new_screen_name_in_aol_america_online_90.html

______________________

Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState