Work the Shell - Converting HTML Forms into Complex Shell Variables

by Dave Taylor

I know, there are a million shell scripts waiting to be written to help administer your computer, run your server and fine-tune your back end, but I'm obsessed with scripts that interact with on-line data, so that's what I'm focusing on. My last column marked the end of our Twitterbot, a simple script that listens and responds to Twitter queries. You can try it by sending an “@” message from your Twitter account to @davesbot.

This month, I thought that given the issue's Entertainment theme, it'd be fun to dig into another facet of shell scripts that interact with the Web by looking at how to emulate a complex form. The form we'll emulate? Yahoo Movies advanced search.

Start by checking out Figure 1 (it shows the form). You can see it live by going to movies.yahoo.com/mv/advsearch too.

Figure 1. Yahoo Movies Advanced Title Search

We can crack open the HTML and read through the source, but I think it's more interesting to reverse-engineer it, because, like most search forms, this one uses the GET method and, therefore, exposes all of its parameters within the URL of the results page. For example, a search for the title “Strangelove”, without any other tweaks, produces the URL below. Normally, this URL would be all on one line, but I've separated the URL and the parameters onto multiple lines to make them a bit easier to see:


http://movies.yahoo.com/mv/search
      ?p=strangelove
      &yr=all
      &gen=all
      &syn=
      &syn_match=all
      &type=feature
      &adv=y

The search engine itself is at the URL shown in the first line of the listing above. The rest of the lines are parameters sent to the search engine. You can see that the search term is “p” (“p=strangelove”). You can infer the other parameters by looking at the form: yr = release decade, gen = genre, syn = synopsis keywords and so on.

Because there are so many possible values, however, we're going to have to look at the source after all. For example, those genres? Here's how Yahoo Movies breaks it down:

  • act = Action/Adventure

  • ada = Adaptation

  • ani = Animation

  • ... (lots of entries skipped for space)

  • tee = Teen

  • thr = Thriller

  • war = War

  • wes = Western

It's quite a list, really!

The question is, can we turn a form of this nature into a simple interactive shell script that will let users specify constraints on a search and pop open a Web browser with the resultant search? Of course we can!

Turning HTML into a Script

It would be cool to normalize the problem and come up with a general-purpose solution, some sort of parser that would take HTML form tags as input and produce shell script segments as output. Uh, no thanks.

Instead, with a few hacks in vi (yeah, I don't use Emacs), I have the following, as part of a usage() function:


usage()
{
cat << EOF
USAGE: findmovie -g genre -k keywords -nrst title
Where
   -n   only match those that have news or features
   -r   only match those with reviews
   -s   only match those that have showtimes
   -t   only match those that have trailers

and genre can be one of:
  act (Action/Adventure), ada (Adaptation), ani (Animation),
  ...
  tee (Teen), thr (Thriller), war (War) or wes (Western).
EOF

}

This makes life easy and pushes the trick of remembering the three-letter abbreviation for the genre onto the user. Sneaky, eh? Now, to be fair, good interface design would have me writing a more sophisticated script that lets users enter a variety of abbreviations (or the full word) and converts them into the proper Yahoo-approved abbreviation, but that's actually work, so we'll skip that too, okay?

Now, note the actual usage I've created:

USAGE: findmovie -g genre -k keywords -nrst title

This means there are a couple elements of the form that we are going to ignore in the script, including which decade the film was released and some of the more obscure conditional parameters. Still, it's enough to keep us busy.

Parsing Parameters with getopts

I've talked about the splendid getopts within shell scripts before, without which parsing the six parameters—two of which have arguments, four of which don't—would be a huge hassle. Instead, this is straightforward. Here are the first few lines to give you the idea:


while getopts "g:k:nrst" arg
do
  case "$arg" in
    g) params="${params:+$params&}gen=$OPTARG" ;;

There's a lot to talk about here, but we have covered getopts before, and you can <cough> check the man page too, right? In a nutshell though, a letter with a trailing colon means it has a required parameter, so g and k have arguments (g:k:), while n, r, s and t do not (nrst).

The params expansion is a nifty little shell trick that's worth a special mention too. The notation ${params:+$params } expands to the value of the $params variable, plus a trailing space, if the variable already has a value. Otherwise, it's the null string. The point? To avoid leading ampersands in the URL that we're building.

Let's have a quick peek:


$ findmovie.sh -g war -k peace -r
finished. params = gen=war&syn=peace&revs=1

As we'd hope, the params variable has been expanded to reflect the specific values that the user has specified on the command line—in this case, War films that have reviews and contain the word “peace” in the synopsis.

Building the Full URL

There's a hiccup waiting to bite us with the code in its current state though. The problem is, what if the user specifies two words in the keywords value field or, worse, does so in the title field (remember, the last word or words are the title pattern, the core search for the Yahoo Movies system)?

The answer is that we need to convert spaces into symbols that are acceptable by the http system. That's easily done, fortunately:

params="$(echo $params | sed 's/ /+/g')"

It's not the most elegant solution, but it's certainly functional!

The bigger problem here is that Yahoo requires certain parameters actually be present to do a search. Choose a genre on the Web interface and click search, and you'll see that's not sufficient for it to proceed.

As a result, our base URL for searches is going to be a bit more complicated:


baseurl="http://movies.yahoo.com/mv/search"
baseurl="${baseurl}?yr=all&syn_match=all&"

Try that, and you'll find it doesn't work. Why? Because there are some hidden parameters that Yahoo has slipped into the form that are required to send to the search program. Without them, it just stops.

In fact, here's the baseurl value we need:


baseurl="http://movies.yahoo.com/mv/search"
baseurl="${baseurl}?yr=all&syn_match=all&adv=y&type=feature&"

Now, how do we put this all together? It's not so easy, because we still need to grab whatever's on the end of the invocation (the title pattern), then mask the spaces:

shift $(( $OPTIND - 1 ))

Hang on, let me explain this line before we go further. OPTIND contains the index into the positional parameters of the script, indicating the first parameter that wasn't absorbed by the getopts processing. Unfortunately, it's indexed from 1, and the options array is indexed starting at zero. The result? We have to subtract one from the value to be able to get the actual value with the $* notation:


params="$(echo $params | sed 's/ /+/g')"

pattern="$(echo $* | sed 's/ /+/g')"
echo URL: $baseurl${params}\&p=$pattern

Now, finally, armed with that, we can search for films that contain the word “love” and have reviews:


$ findmovie.sh -r love

URL: ...BASEURL...revs=1&p=love

Type that in, and you'll find it works fine, showing 80 films where “love” appears in the title and Yahoo Movies is aware of on-line reviews of the films.

Most Linuxes and other flavors of UNIX have a way that you can launch a Web browser from the command line, with the specified URL as its home. That's what we'll do:


echo $baseurl${params}\&p=$pattern
open -a safari "$baseurl${params}\&p=$pattern"

There are other things we can do now that we've converted the Yahoo advanced search form into a shell script, but we'll leave those for next month!

Dave Taylor has been hacking shell scripts for a really long time, 30 years. He's the author of the popular Wicked Cool Shell Scripts and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.

Load Disqus comments