Work the Shell - Simple Scripts to Sophisticated HTML Forms

by Dave Taylor

Last month, we looked at how to convert an HTML form on a page into a shell script with command flags and variables that let you have access to all the features of the search box. We tapped into Yahoo Movies and are building a script that offers up the key capabilities on the search form at movies.yahoo.com/mv/advsearch.

The script we built ended up with this usage statement:

USAGE: findmovie -g genre -k keywords -nrst title

So, that gives you an idea of what we're trying to do. Last month, we stopped with a script that offered the capabilities above and could open a Web browser with the result of the search using the open command.

Now, let's start with a caveat: open is a Mac OS X command-line script that lets you launch a GUI app. Just about every other Linux/UNIX flavor has a similar feature, including if you're running the X Window System. In fact, with most of them, it's even easier. A typical Linux version of “open a Web browser with this URL loaded” might be as simple as:


firefox http://www.linuxjournal.com/ &

That's easily done, even in a shell script.

Actually, if you're going to end a script by invoking a specific command, the best way to do it is to “exec” the command, which basically replaces the script with the app you've specified, so it's not still running and doesn't even need to exit. So in that case, it might look like exec firefox "$url" as the last line of the script.

This month, I want to go back and make our script do more interesting things. For now, an invocation like:

./findmovie.sh -g act evil

produces a command from the last few lines in the script:


echo $baseurl${params}\&p=$pattern
exec open -a safari "$baseurl${params}\&p=$pattern"

that ends up pushing out this:

http://movies.yahoo.com/mv/
↪search?yr=all&syn_match=all&adv=y&type=feature&gen=act&p=evil

It's pretty sophisticated!

Letting the User Dump the Resultant Data

What if the user wants the option of dumping the data to the command line instead of launching a browser? We can address that by adding a -d dump command flag into the getopt block:


while getopts "dg:k:nrst" arg
do
  case "$arg" in
    d ) dump=1 ;;
    g ) params="${params:+$params&}gen=$OPTARG" ;;

To dump the data, we'll enlist the powerful curl command, as we've done in the past. The program has zillions of options, but as we're just interested in the raw output, we can ignore them all (fortunately) except for --silent, which hides status updates, leaving the conditional:


if [ $dump -eq 1 ] ; then
  exec /usr/bin/curl --silent "$baseurl${params}\&p=$pattern"
else
  exec open -a safari "$baseurl${params}\&p=$pattern"
fi

But, that generates a huge amount of data, including all the HTML needed to produce the page in question. Let's spend just a minute looking closely at that output and see if there's a way to trim things at least a bit.

It turns out that every movie title that's matched includes a link to the movie's information on the Yahoo Movies site. Those look like:


<a href="http://movies.yahoo.com/movie/1809697875/info">Resident Evil

So, that's easy to detect. Better, we can use a regex expression with grep and skip a lot of superfluous data too:

cmd | grep '/movie/.*info'

That comes close to having only the lines that match individual movies, but to take this one step further, let's remove the false matches for dvdinfo, because we're not interested in the links to DVD release info. That's a grep -v:

cmd | grep '/movie/.*info' | grep -v dvdinfo

Now, let's have a quick peek at comedies that have the word “funny” in their titles:

./findmovie.sh -d -g com funny | grep '/movie/.*info' 
 ↪| grep -v dvdinfo |  head -3

<td><a href="http://movies.yahoo.com/movie/1810041785/info">
<b>Funny</b> People (2009)</a><br>

<td><a href="http://movies.yahoo.com/movie/1809406735/info">What's So 
 <b>Funny</b> About Me? (1997)</a><br>

<td><a href="http://movies.yahoo.com/movie/1808565885/info">That 
 <b>Funny</b> Feeling (1965)</a><br>

Okay, so the first three films in that jumble of HTML are Funny People, What's So Funny About Me? and That Funny Feeling.

From this point, you definitely can poke around and write some better filters to extract the specific information you want. The wrinkle? Like most other sites, Yahoo Movies chops the results into multiple pages, so what you'd really want to do is identify how many pages of results there are going to be and then grab the results from each, one by one. It's tedious, but doable.

How Many Matches?

Let's look at a more interesting subset, instead, by adding a -c flag to have it output just a count of how many films match the specified criteria, you've given the command instead.

To do that, we don't need to go page by page, but just identify and extract the value from the match count on the page. For the comedies with “funny” in the title, the line on the page looks like this: “< Prev | 1 - 20 of 37 | Next 17 >”.

What we need to do is crack the HTML and look at the source to the link to “next 17” and see if it's extractable (is that a word?):

./findmovie.sh -d -g com funny | grep -i "next 17" | head -1

<td align=right><font face=arial size="-2"><nobr>
↪&lt;&nbsp;Prev&nbsp;|&nbsp;<b>1 - 20</b>
↪&nbsp;of&nbsp;<b>37</b>&nbsp;|&nbsp;<span
↪class="yperlink"><ai href="/mv/search?p=funny&yr=all
↪&gen=com\&syn_match=all&adv=y&type=feature
↪&n=17&b=21&h=s">Next 17</a>&nbsp;&gt;
↪&nbsp;</nobr></span></span></font></td></tr>

Well that's ugly. You'd think Yahoo didn't want to make this easy or something! It turns out though that this is a pretty tricky task, because if there are no matches, the link doesn't show up, and instead you see “Sorry, no matches were found”. If there are less than 20 matches, you see “Next >”, but it's not a clickable link, so it's not going to be so easy!

Given that I'm out of space, let's defer this topic until next month. Meanwhile, look at the source to various searches yourself and see if anything comes to mind. Otherwise, it'll be brute force!

Dave Taylor has been hacking shell scripts for a really long time, 30 years. He's the author of the popular Wicked Cool Shell Scripts and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.

Load Disqus comments