Work the Shell - Simple Scripts to Sophisticated HTML Forms, Take II
We've been digging into the Yahoo Movies database for the past few months, as you'll recall, building a command called findmovie that will have the following usage:
USAGE: findmovie -g genre -k keywords -nrst title
However, we slammed into a wall at 100kph last month in the simplest of calculations: how many titles match a given combination of query elements?
For example, how many action films are there that have “death” in the title? That'd look like findmovie -g act death, but making that count actually work is tricky, because the Yahoo Movies database output is different depending on whether there are zero matches, less than a page of matches or more than a page of matches. Examples of each output are “Sorry, no matches were found”, “(All results shown)” and “< Prev | 1 - 20 of 143 | Next 20 >”, respectively.
Oh, and it gets worse. Sometimes when there's less than a full page of results, you'll see something like this: “< Prev | 1 - 3 of 3 | Next >” instead.
It's pretty much a huge pain in the booty, and even if you crack open the source, there's no handy spot that says “0” or “4” or “143”. So, that's what I want to focus on this month—parsing an HTML file to isolate and identify this particular data point.
The first observation I have about identifying a solution is that we are going to need to cache (or save) the results, so we can parse it more than once to see what we find. This brings up the old shell scripting challenge of choosing a good, unique, temporary filename.
I'm old-school. I'm used to using .$$ to use the process ID as the basis of the temp file, but in fact, there are better solutions in modern Linux systems. Check out mktemp if you're on a BSD-based system. If that's not available, use man smartly: man -k temp | grep '(1' will extract the replacement that your distro has instead. Here's a typical use of mktemp:
appname=$(basename $0)
TMPFILE=$(mktemp /tmp/${appname}.XXXXXX) || exit 1
It looks pretty similar, but by using that many X characters, the program uses the PID and random letters, making the temp file impossible for a hacker to guess or anticipate. The version of this script I've been developing on my Mac OS X system had the following code snippet:
if [ $dump -eq 1 ] ; then
exec /usr/bin/curl --silent "$baseurl${params}\&p=$pattern"
else
exec open -a safari "$baseurl${params}\&p=$pattern"
fi
The problem here is that using exec to invoke a command replaces the shell script with the command in question, which isn't going to work. Instead, it's time to rewrite it:
if [ $dump -eq 1 ] ; then
appname=$(basename $0)
TMPFILE=$(mktemp /tmp/${appname}.XXXXXX) || exit 1
/usr/bin/curl --silent "$baseurl${params}\&p=$pattern" \
> $TMPFILE
else
exec open -a safari "$baseurl${params}\&p=$pattern"
fi
That looks good. If we're dumping the file source, it'll go to the temporary file for later analysis. If it's a request that is supposed to launch the search results in a browser, it still uses the Mac OS X open command.
To figure out what's going on, we need to account for three different possibilities, each of which has a different “fingerprint” in the source file. Here's a rough template:
if [ ! -z "$(grep -i "no matches were found" $TMPFILE)" ] then echo there are zero results for that search. elif [ ! -z "$(grep -i "Next >" $TMPFILE)" ] then echo got some results with case two. else echo more than a page of results fi
Here, I'm showing only output echo statements to give you a sense of the algorithm, but you can see that we're just testing for a known string that hopefully won't show up in other situations. Note the third test, though: Next > is some HTML weirdness. “nbsp” is a non-breaking space, and “gt” is the > symbol. Wrap 'em in “&” and “;”, and you have HTML character entities.
To ascertain the total match count requires yet more parsing of the output. Search for “death race”, and you'll find three matches, which end up looking like this:
<b>3</b>
Unfortunately, it's rather buried in a more complicated pattern, because here's a typical match:
<td align=right><font face=arial size="-2"><nobr> ↪< Prev | <b>1 - 3</b> ↪ of <b>3</b> ...
I have to admit, I was stumped for a bit, which is why having geeky friends like Martin and Lucretia M. Pruitt is so darn helpful. I posed this puzzle on Twitter (I'm @DaveTaylor if you want to follow me), and after some false starts, they suggested a simple and logical solution: turn the <b> and </b> into individual character delimiters, then simply use cut to pull out the field we seek. Smart!
Here's how that looks as a simple command sequence:
grep -i "1 - " $TMPFILE | sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4
Armed with this, the ugly HTML sequence above quickly reduces down to the value 3, which is exactly what we want. One nuance, though. It turns out that this data appears both before and after the matches, so we need to slip | head -1 to ensure that we're parsing only one line and not duplicating the data entry or confusing the new parser. This means we can create the following code:
if [ ! -z "$(grep -i "no matches were found" $TMPFILE)" ]
then
matches=0
elif [ ! -z "$(grep -i "Next >" $TMPFILE)" ]
then
matches="$(grep -i "1 - " $TMPFILE | head -1 | \
sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
else
matches="$(grep -i "1 - " $TMPFILE | head -1 | \
sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
fi
You can see how I'm differentiating the three cases and how the resultant code is fairly similar in the second and third cases. In fact, they don't need to be separate cases, so the count is more easily calculated like this:
if [ ! -z "$(grep -i "no matches were found" $TMPFILE)" ]
then
matches=0
else
matches="$(grep -i "1 - " $TMPFILE | head -1 | \
sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
fi
If you initialized matches to zero, you actually can flip the logic of the first conditional and prune it down even further:
matches=0
if [ -z "$(grep -i "no matches were found" $TMPFILE)" ]
then
matches="$(grep -i "1 - " $TMPFILE | head -1 | \
sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
fi
Nice. It's a simple, straightforward and fine example of how if you keep thinking about what you're really accomplishing with complex conditionals, they often can be not only simplified, but sped up too.
Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- Designing Electronics with Linux
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Dynamic DNS—an Object Lesson in Problem Solving
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Using Salt Stack and Vagrant for Drupal Development
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




1 hour 50 min ago
2 hours 24 min ago
3 hours 23 min ago
4 hours 13 min ago
8 hours 15 min ago
12 hours 2 min ago
12 hours 10 min ago
14 hours 25 min ago
16 hours 54 min ago
1 day 2 hours ago