Analyzing Data

Each line has the following components:

  • IP address from which the request was made.

  • Two fields (represented with - characters) having to do with authentication.

  • The timestamp.

  • The HTTP request, starting with the HTTP request method (usually GET or POST) and a URL.

  • The result code, in which 200 represents "OK".

  • The number of bytes transferred.

  • The referrer, meaning the URL that the user came from.

  • The way in which the browser identifies itself.

This information might seem a bit primitive and limited, but you can use it to understand a large number of factors better having to do with visitors to your blog. Note that it doesn't include information that JavaScript-based analytics packages (for example, Google Analytics) can provide, such as session, browser information and cookies. Nevertheless, logfiles can provide you with some good basics.

Two of the first steps of any data science project are 1) importing the data and 2) cleaning the data. That's because any data source will have information that's not really useful or relevant for your purposes, which will throw off the statistics or add useless bloat to the data you're trying to import. Thus, here I'm going to try to read the Apache logfile into Python, removing those lines that are irrelevant. Of course, what is deemed to be "irrelevant" is somewhat subjective; I'll get to that in just a bit.

Let's start with a very simple parsing of the Apache logfile. One of the first things Python programmers learn is how to iterate over the lines of a file:


infile = 'short-access-log'
for line in open(infile):
    print(line)

The above will print the file, one line at a time. However, for this example, I'm not interested in printing it; rather, I'm interested in turning it into a CSV file. Moreover, I want to remove the lines that are less interesting or that provide spurious (junk) data.

In order to create a CSV file, I'm going to use the csv module that comes with Python. One advantage of this module is that it can take any separator; despite the name, I prefer to use tabs between my columns, because there's no chance of mixing up tabs with the data I'm passing.

But, how do you get the data from the logfile into the CSV module? A simple-minded way to deal with this would be to break the input string using the str.split method. The good news is that split will work, at least to some degree, but the bad news is that it'll parse things far less elegantly than you might like. And, you'll end up with all sorts of crazy stuff going on.

The bottom line is that if you want to read from an Apache logfile, you'll need to figure out the logfile format and read it, probably using a regular expression. Or, if you're a bit smarter, you can use an existing library that already has implemented the regexp and logic. I searched on PyPI (the Python Package Index) and found clfparser, a package that knows how to parse Apache logfiles in what's known as the "common logfile format" used by a number of HTTP servers for many years. If the variable line contains one line from my Apache logfile, I can do the following:


from clfparser import CLFParser
infilename = 'short-access-log'
for line in open(infilename):
    print CLFParser.logDict(line)

In this way, I have turned each line of my logfile into a Python dictionary, with each key-value pair in the dictionary referencing a different field from my logfile's row.

Now I can go back to my CSV module and employ the DictWriter class that comes with it. DictWriter, as you probably can guess, allows you to output CSV based on a dictionary. All you need to do is declare the fields you want, allowing you to ignore some or even to set their order in the resulting CSV file. Then you can iterate over your file and create the CSV.

Here's the code I came up with:


import csv
from clfparser import CLFParser

infilename = 'short-access-log'
outfilename = 'access.csv'

with open(outfilename, 'w') as outfile, open(infilename) as infile:
    fieldnames = ['Referer', 'Useragent', 'b', 'h', 'l', 'r', 's',
     ↪'t', 'time', 'timezone', 'u']
    writer = csv.DictWriter(outfile, fieldnames=fieldnames,
     ↪delimiter='\t')
    writer.writeheader()

    for line in infile:
        writer.writerow(CLFParser.logDict(line))

Let's walk through this code, one piece at a time. It's not very complex, but it does pull together a number of packages and functionality that provide a great deal of power in a small space:

  • First, I import both the csv module and the CLFParser class from the clfparser module. I'm going to be using both of these modules in this program; the first will allow me to output CSV, and the second will let me read from the Apache logs.

  • I set the names of the input and output files here, both to clean up the following code a bit and to make it easier to reuse this code later.

  • I then use the with statement, which invokes what's known as a "context manager" in Python. The basic idea here is that I'm creating two file objects, one for reading (the logfile) and one for writing (the CSV file). When the with block ends, both files will be closed, ensuring that no data has been left behind or is still in a buffer.

  • Given that I'm going to be using the CSV module's DictWriter, I need to indicate the order in which fields will be output. I do this in a list; this list allows allow me to remove or reorder fields, should I want to do so.

  • I then create the csv.DictWriter object, telling it that I want to write data to outfile, using the field names I just defined and using tab as a delimiter between fields.

  • I then write a header to the file; although this isn't crucial, I recommend that you do so for easier debugging later. Besides, all CSV parsers that I know of are able to handle such a thing without any issues.

  • Finally, I iterate over the rows of the access log, turning each line into a dictionary and then writing that dictionary to the CSV file. Indeed, you could argue that the final line there is the entire point of this program; everything up to that point is just a preface.

______________________