Work the Shell - Simple Scripts to Sophisticated HTML Forms, Take II
We've been digging into the Yahoo Movies database for the past few months, as you'll recall, building a command called findmovie that will have the following usage:
USAGE: findmovie -g genre -k keywords -nrst title
However, we slammed into a wall at 100kph last month in the simplest of calculations: how many titles match a given combination of query elements?
For example, how many action films are there that have “death” in the title? That'd look like findmovie -g act death, but making that count actually work is tricky, because the Yahoo Movies database output is different depending on whether there are zero matches, less than a page of matches or more than a page of matches. Examples of each output are “Sorry, no matches were found”, “(All results shown)” and “< Prev | 1 - 20 of 143 | Next 20 >”, respectively.
Oh, and it gets worse. Sometimes when there's less than a full page of results, you'll see something like this: “< Prev | 1 - 3 of 3 | Next >” instead.
It's pretty much a huge pain in the booty, and even if you crack open the source, there's no handy spot that says “0” or “4” or “143”. So, that's what I want to focus on this month—parsing an HTML file to isolate and identify this particular data point.
The first observation I have about identifying a solution is that we are going to need to cache (or save) the results, so we can parse it more than once to see what we find. This brings up the old shell scripting challenge of choosing a good, unique, temporary filename.
I'm old-school. I'm used to using .$$ to use the process ID as the basis of the temp file, but in fact, there are better solutions in modern Linux systems. Check out mktemp if you're on a BSD-based system. If that's not available, use man smartly: man -k temp | grep '(1' will extract the replacement that your distro has instead. Here's a typical use of mktemp:
appname=$(basename $0)
TMPFILE=$(mktemp /tmp/${appname}.XXXXXX) || exit 1
It looks pretty similar, but by using that many X characters, the program uses the PID and random letters, making the temp file impossible for a hacker to guess or anticipate. The version of this script I've been developing on my Mac OS X system had the following code snippet:
if [ $dump -eq 1 ] ; then
exec /usr/bin/curl --silent "$baseurl${params}\&p=$pattern"
else
exec open -a safari "$baseurl${params}\&p=$pattern"
fi
The problem here is that using exec to invoke a command replaces the shell script with the command in question, which isn't going to work. Instead, it's time to rewrite it:
if [ $dump -eq 1 ] ; then
appname=$(basename $0)
TMPFILE=$(mktemp /tmp/${appname}.XXXXXX) || exit 1
/usr/bin/curl --silent "$baseurl${params}\&p=$pattern" \
> $TMPFILE
else
exec open -a safari "$baseurl${params}\&p=$pattern"
fi
That looks good. If we're dumping the file source, it'll go to the temporary file for later analysis. If it's a request that is supposed to launch the search results in a browser, it still uses the Mac OS X open command.
To figure out what's going on, we need to account for three different possibilities, each of which has a different “fingerprint” in the source file. Here's a rough template:
if [ ! -z "$(grep -i "no matches were found" $TMPFILE)" ] then echo there are zero results for that search. elif [ ! -z "$(grep -i "Next >" $TMPFILE)" ] then echo got some results with case two. else echo more than a page of results fi
Here, I'm showing only output echo statements to give you a sense of the algorithm, but you can see that we're just testing for a known string that hopefully won't show up in other situations. Note the third test, though: Next > is some HTML weirdness. “nbsp” is a non-breaking space, and “gt” is the > symbol. Wrap 'em in “&” and “;”, and you have HTML character entities.
To ascertain the total match count requires yet more parsing of the output. Search for “death race”, and you'll find three matches, which end up looking like this:
<b>3</b>
Unfortunately, it's rather buried in a more complicated pattern, because here's a typical match:
<td align=right><font face=arial size="-2"><nobr> ↪< Prev | <b>1 - 3</b> ↪ of <b>3</b> ...
I have to admit, I was stumped for a bit, which is why having geeky friends like Martin and Lucretia M. Pruitt is so darn helpful. I posed this puzzle on Twitter (I'm @DaveTaylor if you want to follow me), and after some false starts, they suggested a simple and logical solution: turn the <b> and </b> into individual character delimiters, then simply use cut to pull out the field we seek. Smart!
Here's how that looks as a simple command sequence:
grep -i "1 - " $TMPFILE | sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4
Armed with this, the ugly HTML sequence above quickly reduces down to the value 3, which is exactly what we want. One nuance, though. It turns out that this data appears both before and after the matches, so we need to slip | head -1 to ensure that we're parsing only one line and not duplicating the data entry or confusing the new parser. This means we can create the following code:
if [ ! -z "$(grep -i "no matches were found" $TMPFILE)" ]
then
matches=0
elif [ ! -z "$(grep -i "Next >" $TMPFILE)" ]
then
matches="$(grep -i "1 - " $TMPFILE | head -1 | \
sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
else
matches="$(grep -i "1 - " $TMPFILE | head -1 | \
sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
fi
You can see how I'm differentiating the three cases and how the resultant code is fairly similar in the second and third cases. In fact, they don't need to be separate cases, so the count is more easily calculated like this:
if [ ! -z "$(grep -i "no matches were found" $TMPFILE)" ]
then
matches=0
else
matches="$(grep -i "1 - " $TMPFILE | head -1 | \
sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
fi
If you initialized matches to zero, you actually can flip the logic of the first conditional and prune it down even further:
matches=0
if [ -z "$(grep -i "no matches were found" $TMPFILE)" ]
then
matches="$(grep -i "1 - " $TMPFILE | head -1 | \
sed 's/<b>/~/g;s/<\/b>/~/g' | cut -d\~ -f4)"
fi
Nice. It's a simple, straightforward and fine example of how if you keep thinking about what you're really accomplishing with complex conditionals, they often can be not only simplified, but sped up too.
Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- RSS Feeds
- Trying to Tame the Tablet
- What's the tweeting protocol?
- New Products
- Dart: a New Web Programming Experience
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




1 hour 56 min ago
4 hours 18 min ago
21 hours 7 min ago
23 hours 39 min ago
1 day 56 min ago
1 day 1 hour ago
1 day 1 hour ago
1 day 6 hours ago
1 day 7 hours ago
1 day 9 hours ago