Analyzing Data

Cleaning the Data

You've now seen that you can import the data from another form into a CSV file, which is one of the most common formats used in data science. However, as I mentioned previously, one of the key things that you also need to do is clean the data; analyzing bogus data will give you bogus results.

So, what sort of data here needs to be cleaned?

One obvious candidate is to remove anything that wasn't a real human. Perhaps you're interested in finding out what Web crawlers, such as those from Google and Yahoo, are up to. But it's more likely that you want to know what humans are doing, which means removing all of those robots.

Of course, this raises the question of how you can know whether a request is coming from a robot. As humans, you can examine the User-agent string and make an educated guess. But given that you're trying to remove all of the robots, and that new ones constantly are being added, something automatic would be better.

There's no perfect answer to this, but for the purposes of this article, I decided to use another Python module from PyPI, albeit one that's a bit out of date—one known as robot-detection. The idea is that you import this module and then use the is_robot function on the Useragent field. If it's a robot, is_robot will return True. Here's my revised code:


import csv
from clfparser import CLFParser
from collections import Counter
import robot_detection

infilename = 'medium-access-log.txt'
outfilename = 'access.csv'
robot_count = Counter()

with open(outfilename, 'w') as outfile, open(infilename) as infile:
    fieldnames = ['Referer', 'Useragent', 'b', 'h', 'l', 'r', 's',
     ↪'t', 'time', 'timezone', 'u']
    writer = csv.DictWriter(outfile, fieldnames=fieldnames,
     ↪delimiter='\t')
    writer.writeheader()

    for line in infile:
        d = CLFParser.logDict(line)
        if robot_detection.is_robot(d['Useragent']):
            robot_count[d['Useragent']] += 1
        else:
            writer.writerow(d)

The above code is mostly unchanged from the previous version; the two modifications are that I'm now using robot_detection to filter out the robots, and I'm using the Python Counter class to keep track of how many times each robot is making a request. This alone might be useful information to have—perhaps not now, but in the future. For example, from examining the most recent 100,000 requests to my blog, I found that there were more than 1,000 requests from the "domain re-animator bot", something I hadn't even heard of before.

Given that I'm currently concentrating on user data, filtering out these bot requests made my data more reliable and also a great deal shorter. Out of 100,000 records, only 27,000 were from actual humans.

Conclusion

The first step of any data-analysis project is to import and clean the data. Here, I have taken the data and put it into CSV format, filtering out some of the lines that are of less interest. But this is just the start of my analysis, not its end. Next month, I'll explain how you can import this data into Python's Pandas package and start to analyze the logfile in a number of different ways.

Resources

Data science is a hot topic, and many people have been writing good books on the subject. I've most recently been reading and enjoying an early release of the Python Data Science Handbook by Jake VanderPlas, which contains great information on data science as well as its use from within Python. Cathy O'Neil and Rachel Schutt's slightly older book Doing Data Science is also excellent, approaching problems from a different angle. Both are published by O'Reilly, and both are great reads.

To learn more about the Python tools used in data science, check out the sites for NumPy, SciPy, Pandas and IPython. There is a great deal to learn, so be prepared for a deep dive and lots of reading.

Python itself is available here, and the PyPI package index, from which you can download all of the packages mentioned in this article, is here.

______________________