Work the Shell - Simple Scripts to Sophisticated HTML Forms, Take II

 in
Parsing HTML files.

We've been digging into the Yahoo Movies database for the past few months, as you'll recall, building a command called findmovie that will have the following usage:

USAGE: findmovie -g genre -k keywords -nrst title

However, we slammed into a wall at 100kph last month in the simplest of calculations: how many titles match a given combination of query elements?

For example, how many action films are there that have “death” in the title? That'd look like findmovie -g act death, but making that count actually work is tricky, because the Yahoo Movies database output is different depending on whether there are zero matches, less than a page of matches or more than a page of matches. Examples of each output are “Sorry, no matches were found”, “(All results shown)” and “< Prev | 1 - 20 of 143 | Next 20 >”, respectively.

Oh, and it gets worse. Sometimes when there's less than a full page of results, you'll see something like this: “< Prev | 1 - 3 of 3 | Next >” instead.

It's pretty much a huge pain in the booty, and even if you crack open the source, there's no handy spot that says “0” or “4” or “143”. So, that's what I want to focus on this month—parsing an HTML file to isolate and identify this particular data point.

Caching the Results

The first observation I have about identifying a solution is that we are going to need to cache (or save) the results, so we can parse it more than once to see what we find. This brings up the old shell scripting challenge of choosing a good, unique, temporary filename.

I'm old-school. I'm used to using .$$ to use the process ID as the basis of the temp file, but in fact, there are better solutions in modern Linux systems. Check out mktemp if you're on a BSD-based system. If that's not available, use man smartly: man -k temp | grep '(1' will extract the replacement that your distro has instead. Here's a typical use of mktemp:

appname=$(basename $0)
TMPFILE=$(mktemp /tmp/${appname}.XXXXXX) || exit 1 

It looks pretty similar, but by using that many X characters, the program uses the PID and random letters, making the temp file impossible for a hacker to guess or anticipate. The version of this script I've been developing on my Mac OS X system had the following code snippet:


if [ $dump -eq 1 ] ; then
  exec /usr/bin/curl --silent "$baseurl${params}\&p=$pattern"
else
  exec open -a safari "$baseurl${params}\&p=$pattern"
fi 

The problem here is that using exec to invoke a command replaces the shell script with the command in question, which isn't going to work. Instead, it's time to rewrite it:


if [ $dump -eq 1 ] ; then
   appname=$(basename $0)
   TMPFILE=$(mktemp /tmp/${appname}.XXXXXX) || exit 1
  /usr/bin/curl --silent "$baseurl${params}\&p=$pattern" \
     > $TMPFILE
else
  exec open -a safari "$baseurl${params}\&p=$pattern"
fi 

That looks good. If we're dumping the file source, it'll go to the temporary file for later analysis. If it's a request that is supposed to launch the search results in a browser, it still uses the Mac OS X open command.

Parsing the Results

To figure out what's going on, we need to account for three different possibilities, each of which has a different “fingerprint” in the source file. Here's a rough template:

if [ ! -z "$(grep -i "no matches were found" $TMPFILE)" ]
then
  echo there are zero results for that search.
elif [ ! -z "$(grep -i "Next&nbsp;&gt;" $TMPFILE)" ]
then
  echo got some results with case two.
else
  echo more than a page of results
fi 

Here, I'm showing only output echo statements to give you a sense of the algorithm, but you can see that we're just testing for a known string that hopefully won't show up in other situations. Note the third test, though: Next&nbsp;&gt; is some HTML weirdness. “nbsp” is a non-breaking space, and “gt” is the > symbol. Wrap 'em in “&” and “;”, and you have HTML character entities.

To ascertain the total match count requires yet more parsing of the output. Search for “death race”, and you'll find three matches, which end up looking like this:


<b>3</b> 

Unfortunately, it's rather buried in a more complicated pattern, because here's a typical match:

<td align=right><font face=arial size="-2"><nobr>
↪&lt;&nbsp;Prev&nbsp;|&nbsp;<b>1 - 3</b>
↪&nbsp;of&nbsp;<b>3</b>&nbsp;... 

I have to admit, I was stumped for a bit, which is why having geeky friends like Martin and Lucretia M. Pruitt is so darn helpful. I posed this puzzle on Twitter (I'm @DaveTaylor if you want to follow me), and after some false starts, they suggested a simple and logical solution: turn the <b> and </b> into individual character delimiters, then simply use cut to pull out the field we seek. Smart!

Here's how that looks as a simple command sequence:


grep -i "1 - " $TMPFILE |
   sed 's/<b>/~/g;s/<\/b>/~/g' |
   cut -d\~ -f4 

Armed with this, the ugly HTML sequence above quickly reduces down to the value 3, which is exactly what we want. One nuance, though. It turns out that this data appears both before and after the matches, so we need to slip | head -1 to ensure that we're parsing only one line and not duplicating the data entry or confusing the new parser. This means we can create the following code:


if [ ! -z "$(grep -i "no matches were found" $TMPFILE)" ]
then
  matches=0
elif [ ! -z "$(grep -i "Next&nbsp;&gt;" $TMPFILE)" ]
then
  matches="$(grep -i "1 - " $TMPFILE | head -1 | \
     sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
else
  matches="$(grep -i "1 - " $TMPFILE | head -1 | \
     sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
fi 

You can see how I'm differentiating the three cases and how the resultant code is fairly similar in the second and third cases. In fact, they don't need to be separate cases, so the count is more easily calculated like this:


if [ ! -z "$(grep -i "no matches were found" $TMPFILE)" ]
then
  matches=0
else
  matches="$(grep -i "1 - " $TMPFILE | head -1 | \
     sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
fi 

If you initialized matches to zero, you actually can flip the logic of the first conditional and prune it down even further:


matches=0 
if [ -z "$(grep -i "no matches were found" $TMPFILE)" ]
then
  matches="$(grep -i "1 - " $TMPFILE | head -1 | \
     sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
fi 

Nice. It's a simple, straightforward and fine example of how if you keep thinking about what you're really accomplishing with complex conditionals, they often can be not only simplified, but sped up too.

______________________

Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState