Work the Shell - Analyzing Your Search Keywords

Screen the unwanted results out of your access log searches.

Last month, we started exploring how you can use a shell script to extract and analyze the HTTP_REFERER values out of your Web server log and identify the most common terms and phrases that people used to find your pages. Sounds useful, doesn't it?

The problem is, the script is more nuanced than it initially seems. Last month, we wrapped up with the following shell script:

#!/bin/sh

ACCESSLOG="/var/logs/httpd.logs/access_log"

grep 'google.com/search' $ACCESSLOG | \
  awk '{print $11}' | \
  cut -d\? -f2 | cut -d\& -f1 | \
  sed 's/+/ /g;s/%22/"/g;s/q=//' | \
  sort | \
  uniq -c | \
  sort -rn | \
  head -5

When I run this, here's what I see:

$ sh google-searches.sh
  94 hl=en
  18 client=safari
   6 client=firefox-a
   4 sourceid=navclient
   4 client=opera

That's weird, because it's not search terms, it's other variables that are included with search strings sent from sites like Google (hl=en says that you've constrained searches to English-language sites only, client=safari identifies the user's Web browser as Apple's Safari and so on).

Screening Out False Matches

The problem is revealed when we look at the first ten matches rather than only the first five:

$ sh google-searches.sh | head -10
  94 hl=en
  18 client=safari
   6 client=firefox-a
   4 sourceid=navclient
   4 client=opera
   3 wicked cool scripts
   3 hl=zh-CN
   2 num=100
   2 hs=wNy
   2 barbara nelson%2Bpurses

Ah, so we can see that there are two valid searches here, one for “wicked cool scripts” and one for “Barbara nelson%2Bpurses”. Not sure what the latter one's about, but it's useful and important to see. Fortunately, screening out the bogus matches is as simple as using grep to remove fields that include an equal sign: grep -v '='.

Rather than have that at the very end of the long pipe in the script, however, I'll place it immediately after the sed invocation to strip out false results as soon as possible in the pipeline to speed up the entire script. Now it looks like this:

grep 'google.com/search' $ACCESSLOG | \
  awk '{print $11}' | \
  cut -d\? -f2 | cut -d\& -f1 | \
  sed 's/+/ /g;s/%22/"/g;s/q=//' | \
  grep -v '=' | \
  sort | \
  uniq -c | \
  sort -rn

Notice that the sed statement itself strips out the name= part of the search (q=), so that it's not incorrectly matched in the new grep statement.

Now we have the results we want:

$ sh google-searches.sh | head -10
   3 wicked cool scripts
   2 barbara nelson%2Bpurses
   1 wsj%20password
   1 why did animal kingdom introduce expedition everest
   1 what makes a great speaker%3F
   1 university of phoenix center of writing excellence
   1 ubuntu x problem
   1 triboot osx ubuntu ydl
   1 the best dvd players
   1 symbol html heart

This site doesn't get a huge amount of traffic, so let's run the very same script against my far-busier AskDaveTaylor.com site. The results are more interesting:

$ sh google-searches.sh | head -10
   5 standalone player
   4 psp help
   4 create a myspace
   4 Documents and Settings"
   4 %24NtUninstall
   3 view myspace accounts that are set to private
   3 i cant hear music on runescape
   2 transfer files to psp
   2 sync v3 motorola mac
   2 running unix in windows xp

Much more interesting. Oh, and if you want to know how many searches you're exploring, it's another simple tweak to the script, an invocation of wc:

$ sh google-searches.sh | wc -l
     501

So out of 501 searches, the single-most common search is “standalone player”, which represents only five out of 500, or 1% of my search traffic.

Stripping Out Unwanted Characters

One more step before we walk away from this script for the month: let's get rid of the strange characters that have been carried over from the original URL encoding of the user's Web browser. What am I talking about? The %24, the closing double quote in Documents and Settings and the %2B in the earlier search for purses.

You can figure out all the mappings and convert everything as appropriate, but I'm lazy at the end of the day and will instead simply find all %xx sequences and replace them with a single space.

This sounds hard, but it's a perfect job for sed because it allows you to do pattern matching and then replace the matched material with whatever else you desire. Here's how I'd do that:

sed 's/%[0-9a-fA-F][0-9a-fA-F]/ /g'

Let's look at this closely before you panic. A set separated by square brackets is a set in regular expression terminology, so [0-9] will match any of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 or 0. It turns out that URL encoding uses hexadecimal, so not only can the values be 0-9, but they can also be A, B, C, D, E and F, in upper- or lowercase letters—hence 0-9 and a-f and A-F in the pattern. The overall pattern form is % followed by any of these possible values, followed by any of these possible values again. Now you can see the full pattern.

Finally, before we beat this completely into the ground, note that the bigger structure here in the sed statement is s/old/new/g, which replaces old with new throughout the line, whether it occurs once or 15 times.

We're not quite done yet, however, because we also need to strip the stray double quotes. Again, this is easily added to the sed statement:

sed 's/%[0-9a-fA-F][0-9a-fA-F]/ /g;s/"//g'

Here's the final script:

grep 'google.com/search' $ACCESSLOG | \
  awk '{print $11}' | \
  cut -d\? -f2 | cut -d\& -f1 | \
  sed 's/+/ /g;s/%22/"/g;s/q=//' | \
  sed 's/%[0-9a-fA-F][0-9a-fA-F]/ /g;s/"//g' | \
  grep -v '=' | sort | uniq -c | sort -rn

And, the final results:

$ sh google-searches.sh | head -15
   6  NtUninstall
   5 standalone player
   4 psp music
   4 psp help
   4 creat a myspace
   4 Documents and Settings
   3 view myspace accounts that are set to private
   3 i cant hear music on runescape
   2 transfer files to psp
   2 sync v3 motorola mac
   2 running unix in windows xp
   2 rss feed reader shell
   2 reinstall windows xp hp
   2 psp transfer music
   2 psp internet

Note that once we remove the stray material, things organize slightly differently (for example, here you can see that psp music is one of the top searches, but earlier we had different variations of psp music, and it didn't make it to a top search value).

Okay, enough torturing of the Apache log file. Let's wrap this up and we'll switch to something completely different in the next column! Suggestions? Please e-mail them to me!

Dave Taylor is a 26-year veteran of UNIX, creator of The Elm Mail System, and most recently author of both the best-selling Wicked Cool Shell Scripts and Teach Yourself Unix in 24 Hours, among his 16 technical books. His main Web site is at www.intuitive.com.

______________________

Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Very useful information. I

Ali's picture

Very useful information. I believe is similar to what is used for Google analytics and other similar scripts? I am not a programmer so it would be hard for me to implement it but I am willing to try if it will search more accurate data. Or is just the code used for all these scripts analyzed here?
Thanks
Ali
http://www.dentalhealthsite.com

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState