Work the Shell - Analyzing Your Search Keywords
February 1st, 2007 by Dave Taylor in
Last month, we started exploring how you can use a shell script to extract and analyze the HTTP_REFERER values out of your Web server log and identify the most common terms and phrases that people used to find your pages. Sounds useful, doesn't it?
The problem is, the script is more nuanced than it initially seems. Last month, we wrapped up with the following shell script:
#!/bin/sh
ACCESSLOG="/var/logs/httpd.logs/access_log"
grep 'google.com/search' $ACCESSLOG | \
awk '{print $11}' | \
cut -d\? -f2 | cut -d\& -f1 | \
sed 's/+/ /g;s/%22/"/g;s/q=//' | \
sort | \
uniq -c | \
sort -rn | \
head -5When I run this, here's what I see:
$ sh google-searches.sh 94 hl=en 18 client=safari 6 client=firefox-a 4 sourceid=navclient 4 client=opera
That's weird, because it's not search terms, it's other variables that are included with search strings sent from sites like Google (hl=en says that you've constrained searches to English-language sites only, client=safari identifies the user's Web browser as Apple's Safari and so on).
The problem is revealed when we look at the first ten matches rather than only the first five:
$ sh google-searches.sh | head -10 94 hl=en 18 client=safari 6 client=firefox-a 4 sourceid=navclient 4 client=opera 3 wicked cool scripts 3 hl=zh-CN 2 num=100 2 hs=wNy 2 barbara nelson%2Bpurses
Ah, so we can see that there are two valid searches here, one for “wicked cool scripts” and one for “Barbara nelson%2Bpurses”. Not sure what the latter one's about, but it's useful and important to see. Fortunately, screening out the bogus matches is as simple as using grep to remove fields that include an equal sign: grep -v '='.
Rather than have that at the very end of the long pipe in the script, however, I'll place it immediately after the sed invocation to strip out false results as soon as possible in the pipeline to speed up the entire script. Now it looks like this:
grep 'google.com/search' $ACCESSLOG | \
awk '{print $11}' | \
cut -d\? -f2 | cut -d\& -f1 | \
sed 's/+/ /g;s/%22/"/g;s/q=//' | \
grep -v '=' | \
sort | \
uniq -c | \
sort -rnNotice that the sed statement itself strips out the name= part of the search (q=), so that it's not incorrectly matched in the new grep statement.
Now we have the results we want:
$ sh google-searches.sh | head -10 3 wicked cool scripts 2 barbara nelson%2Bpurses 1 wsj%20password 1 why did animal kingdom introduce expedition everest 1 what makes a great speaker%3F 1 university of phoenix center of writing excellence 1 ubuntu x problem 1 triboot osx ubuntu ydl 1 the best dvd players 1 symbol html heart
This site doesn't get a huge amount of traffic, so let's run the very same script against my far-busier AskDaveTaylor.com site. The results are more interesting:
$ sh google-searches.sh | head -10 5 standalone player 4 psp help 4 create a myspace 4 Documents and Settings" 4 %24NtUninstall 3 view myspace accounts that are set to private 3 i cant hear music on runescape 2 transfer files to psp 2 sync v3 motorola mac 2 running unix in windows xp
Much more interesting. Oh, and if you want to know how many searches you're exploring, it's another simple tweak to the script, an invocation of wc:
$ sh google-searches.sh | wc -l
501So out of 501 searches, the single-most common search is “standalone player”, which represents only five out of 500, or 1% of my search traffic.
One more step before we walk away from this script for the month: let's get rid of the strange characters that have been carried over from the original URL encoding of the user's Web browser. What am I talking about? The %24, the closing double quote in Documents and Settings and the %2B in the earlier search for purses.
You can figure out all the mappings and convert everything as appropriate, but I'm lazy at the end of the day and will instead simply find all %xx sequences and replace them with a single space.
This sounds hard, but it's a perfect job for sed because it allows you to do pattern matching and then replace the matched material with whatever else you desire. Here's how I'd do that:
sed 's/%[0-9a-fA-F][0-9a-fA-F]/ /g'
Let's look at this closely before you panic. A set separated by square brackets is a set in regular expression terminology, so [0-9] will match any of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 0. It turns out that URL encoding uses hexadecimal, so not only can the values be 0-9, but they can also be A, B, C, D, E and F, in upper- or lowercase letters—hence 0-9 and a-f and A-F in the pattern. The overall pattern form is % followed by any of these possible values, followed by any of these possible values again. Now you can see the full pattern.
Finally, before we beat this completely into the ground, note that the bigger structure here in the sed statement is s/old/new/g, which replaces old with new throughout the line, whether it occurs once or 15 times.
We're not quite done yet, however, because we also need to strip the stray double quotes. Again, this is easily added to the sed statement:
sed 's/%[0-9a-fA-F][0-9a-fA-F]/ /g;s/"//g'
Here's the final script:
grep 'google.com/search' $ACCESSLOG | \
awk '{print $11}' | \
cut -d\? -f2 | cut -d\& -f1 | \
sed 's/+/ /g;s/%22/"/g;s/q=//' | \
sed 's/%[0-9a-fA-F][0-9a-fA-F]/ /g;s/"//g' | \
grep -v '=' | sort | uniq -c | sort -rnAnd, the final results:
$ sh google-searches.sh | head -15 6 NtUninstall 5 standalone player 4 psp music 4 psp help 4 creat a myspace 4 Documents and Settings 3 view myspace accounts that are set to private 3 i cant hear music on runescape 2 transfer files to psp 2 sync v3 motorola mac 2 running unix in windows xp 2 rss feed reader shell 2 reinstall windows xp hp 2 psp transfer music 2 psp internet
Note that once we remove the stray material, things organize slightly differently (for example, here you can see that psp music is one of the top searches, but earlier we had different variations of psp music, and it didn't make it to a top search value).
Okay, enough torturing of the Apache log file. Let's wrap this up and we'll switch to something completely different in the next column! Suggestions? Please e-mail them to me!
Special Magazine Offer -- Free Gift with Subscription
Receive a free digital copy of Linux Journal's System Administration Special Edition as well as instant online access to current and past issues. CLICK HERE for offer
Linux Journal: delivering readers the advice and inspiration they need to get the most out of their Linux systems since 1994.
Subscribe now!
The Latest
Newsletter
Tech Tip Videos
- Nov-04-09
- Oct-29-09
- Oct-26-09
Recently Popular
From the Magazine
December 2009, #188
If last month's Infrastrucuture issue was too "big" for you then try on this month's Embedded issue. Find out how to use Player for programming mobile robots, build a humidity controller for your root cellar, find out how to reduce the boot time of your embedded system, and if you're new to embedded systems find out the basics that go into one. You can also read about the Beagle Board, the Mesh Potato and a spate of other interestingly named items. And along with our regular columns don't miss our new monthly column: Economy Size Geek.
Delicious
Digg
StumbleUpon
Reddit
Facebook








Very useful information. I
On October 20th, 2008 Ali (not verified) says:
Very useful information. I believe is similar to what is used for Google analytics and other similar scripts? I am not a programmer so it would be hard for me to implement it but I am willing to try if it will search more accurate data. Or is just the code used for all these scripts analyzed here?
Thanks
Ali
http://www.dentalhealthsite.com
Post new comment