Web Click Stream Analysis Using Linux Clusters
A wealth of information is gathered by the web server (httpd) and stored in the log file. Unfortunately, many sites just archive or delete the logs. In this article we'll look at how to use logs to improve the performance and usability of a web site. We'll also look at how to load the logs into a database, and then query the database to see how people navigate the site and how site performance can be improved.
I chose a Linux clustered database as a proof point of Linux scalability and also to show that all the old Pentium machines in the office could be put to good use. The idea here is to make a pile of old hardware thrash a newer, faster machine. A database server that runs on a cluster addresses one aspect of clustered computing. Most of the Linux supercomputer clusters focus on numeric problems. The cluster in this article addresses the commercial requirements of data storage and manipulation. In a cluster all the computers work on the same query at the same time. The key question is: as you add more computers, how much faster do you get the answer? Inter-node communication, data distribution and load balancing are critical to cluster performance. To demonstrate the load balancing issue I used the Parallel Virtual Machine (PVM) software common to Beowulf clusters. I used the Informix XPS database for testing the cluster database scalability.
Let's tackle the following:
Understanding the web logs and loading them into the database
Using Structured Query Language (SQL) to answer questions about the web site
Measuring the database cluster performance as more hardware is added
The log files for the Apache web server are found in the /var/log/httpd directory (all file names and locations are based on Red Hat 6.2). You'll find a number of logs, the most current one being access_log. The default format of the file is :
192.168.1.142 - - [14/Nov/2000:16:27:21 -0500] "GET /time.html HTTP/1.1" 304 -
The first field is the IP address of the client. The - that comes next is usually the rfc1413 identity check. The default is off due to the overhead associated with using the IdentityCheck capability. Next is any active .htaccess user id authentication. Enclosed in [] is a timestamp of the access. The request Method, in this case a GET, was for the URL /time.html using HTTP/1.1. The 304 is the Status Code for the access (see http://www.w3.org/Protocols/rfc2616/rfc2616.txt for all the codes, plus more info on HTTP than you ever imagined). The final - is the number of bytes sent, none in this case. The 304 return code says the object has not been modified, so no bytes are sent. More information on the log file may be found in the Apache documents: httpd.apache.org/docs/mod/mod_log_config.html.
The default web server log provides lots of information, but it can mask many individual browsers operating from within the same internet service provider (ISP). Cookies are the solution.
Consider the following three lines from an access_log file
206.175.175.226 - - [15/Nov/2000:12:23:36 -0500] "GET / HTTP/1.0" 200 1233 206.175.175.226 - - [15/Nov/2000:12:24:15 -0500] "GET / HTTP/1.0" 200 1233 206.175.175.226 - - [15/Nov/2000:12:25:27 -0500] "GET / HTTP/1.0" 200 1233
It looks like one browser, identified by a single IP address, has accessed the /, or root index, page three times. If logging is enabled with the following:
# add user tracking
CookieTracking on
CustomLog logs/clickstream "%{cookie}n %r %t"
in the httpd.conf configuration file, the above hits tell a very different story. Look in the file called clickstream (as configured above) for the following log information:
206.175.175.226.1017974309016881 GET / HTTP/1.0 [15/Nov/2000:12:23:36 -0500] 206.175.175.226.1028974309055577 GET / HTTP/1.0 [15/Nov/2000:12:24:15 -0500] 206.175.175.226.1017974309127275 GET / HTTP/1.0 [15/Nov/2000:12:25:27 -0500]
There were three completely different browsers accessing the site. This tracking is accomplished using a unique identifier (non-persistent cookie) for each browser. The number following the IP address is a user-tracking cookie sent by the server to the browser. Each time the browser makes a request, the cookie is sent back to the server by the browser.
Since the default installation doesn't have user tracking enabled, the examples in this article don't use CookieTracking. You can jump in with your current logs and do this analysis. You will want to turn on CookieTracking, so all those clicks from "that large ISP" will become individual streams instead of one very busy user.
A note on privacy. A number of companies have been caught with their hand in the cookie jar. A quick internet search yields 124,000 pages that contain the string "privacy" that also contain the string "cookies". This seems to be a topic that has attracted some attention. Develop a privacy policy, post it to your site and live by it.
Now that we have a handle on the log file we need to load it into a database to run queries against the data. The log file strings are not friendly to a database high speed loader, so we will massage them into a delimited string. Dust off your favorite Practical Extraction and Report Language; the code you can download for this is in Perl. A quick disclaimer--my Perl code is not good Perl style. The distinguishing features are it works, and it has at least one comment line.
For the following input:
158.58.240.58 - - [04/Jul/2000:23:59:42 -0500] "GET / HTTP/1.1" 200 47507 203.127.32.40 - - [04/Jul/2000:23:59:25 -0500] "GET /jp/product/images/prod_top.gif HTTP/1.0" 304 - 203.127.32.40 - - [04/Jul/2000:23:59:24 -0500] "GET /jp/product/down.html HTTP/1.0" 200 12390 192.147.84.235 - - [04/Jul/2000:23:59:22 -0500] "GET /idn-secure/Visionary/WebPages/visframe.htm HTTP/1.1" 401 - 211.45.44.33 - - [04/Jul/2000:23:59:19 -0500] "GET /kr/images/top_banner.gif HTTP/1.1" 200 6719 211.45.44.33 - - [04/Jul/2000:23:59:24 -0500] "GET /kr/images/waytowin.gif HTTP/1.1" 200 71488 211.45.44.33 - - [04/Jul/2000:23:59:25 -0500] "GET /kr/train/new.gif HTTP/1.1" 200 416
The output from the Perl program looks like this:
158.58.240.58|-|-|2000-07-04 23:59:42||/|200|47507|-|1| 203.127.32.40|-|-|2000-07-04 23:59:25|jp|/jp/product/images/prod_top.gif|304||gif|4| 203.127.32.40|-|-|2000-07-04 23:59:24|jp|/jp/product/down.html|200|12390|html|3| 192.147.84.235|-|-|2000-07-04 23:59:22|idn|/idn-secure/Visionary/WebPages/visframe.htm|401||htm|4| 211.45.44.33|-|-|2000-07-04 23:59:19|kr|/kr/images/top_banner.gif|200|6719|gif|3| 211.45.44.33|-|-|2000-07-04 23:59:24|kr|/kr/images/waytowin.gif|200|71488|gif|3| 211.45.44.33|-|-|2000-07-04 23:59:25|kr|/kr/train/new.gif|200|416|gif|3|
The last two fields, added by the Perl script are the object type, stripped from the end of the requested URL, and the the link depth, the number of / characters in the URL.
Trending Topics
| You Need A Budget | Feb 10, 2012 |
| The Linux powered LAN Gaming House | Feb 08, 2012 |
| Creating a vDSO: the Colonel's Other Chicken | Feb 06, 2012 |
| Your CMS Is Not Your Web Site | Feb 01, 2012 |
| Casper, the Friendly (and Persistent) Ghost | Jan 31, 2012 |
| Razor-qt 0.4 - Qt based Desktop Environment | Jan 30, 2012 |
- Fun with ethtool
- Parallel Programming with NVIDIA CUDA
- Readers' Choice Awards 2011
- 100% disappointed with the decision to go all digital.
- Linux-Based X Terminals with XDMCP
- Validate an E-Mail Address with PHP, the Right Way
- You Need A Budget
- Build Your Own Arcade Game Player and Relive the '80s!
- The Linux powered LAN Gaming House
- Python for Android





1 hour 58 min ago
7 hours 5 min ago
8 hours 5 min ago
17 hours 33 min ago
17 hours 43 min ago
23 hours 48 min ago
1 day 3 hours ago
1 day 4 hours ago
1 day 4 hours ago
1 day 9 hours ago