Work the Shell - Converting HTML Forms into Complex Shell Variables

Web browser? We don't need no stinkin' Web browser for submitting HTML forms, that's what the shell is for.

I know, there are a million shell scripts waiting to be written to help administer your computer, run your server and fine-tune your back end, but I'm obsessed with scripts that interact with on-line data, so that's what I'm focusing on. My last column marked the end of our Twitterbot, a simple script that listens and responds to Twitter queries. You can try it by sending an “@” message from your Twitter account to @davesbot.

This month, I thought that given the issue's Entertainment theme, it'd be fun to dig into another facet of shell scripts that interact with the Web by looking at how to emulate a complex form. The form we'll emulate? Yahoo Movies advanced search.

Start by checking out Figure 1 (it shows the form). You can see it live by going to too.

Figure 1. Yahoo Movies Advanced Title Search

We can crack open the HTML and read through the source, but I think it's more interesting to reverse-engineer it, because, like most search forms, this one uses the GET method and, therefore, exposes all of its parameters within the URL of the results page. For example, a search for the title “Strangelove”, without any other tweaks, produces the URL below. Normally, this URL would be all on one line, but I've separated the URL and the parameters onto multiple lines to make them a bit easier to see:

The search engine itself is at the URL shown in the first line of the listing above. The rest of the lines are parameters sent to the search engine. You can see that the search term is “p” (“p=strangelove”). You can infer the other parameters by looking at the form: yr = release decade, gen = genre, syn = synopsis keywords and so on.

Because there are so many possible values, however, we're going to have to look at the source after all. For example, those genres? Here's how Yahoo Movies breaks it down:

  • act = Action/Adventure

  • ada = Adaptation

  • ani = Animation

  • ... (lots of entries skipped for space)

  • tee = Teen

  • thr = Thriller

  • war = War

  • wes = Western

It's quite a list, really!

The question is, can we turn a form of this nature into a simple interactive shell script that will let users specify constraints on a search and pop open a Web browser with the resultant search? Of course we can!

Turning HTML into a Script

It would be cool to normalize the problem and come up with a general-purpose solution, some sort of parser that would take HTML form tags as input and produce shell script segments as output. Uh, no thanks.

Instead, with a few hacks in vi (yeah, I don't use Emacs), I have the following, as part of a usage() function:

cat << EOF
USAGE: findmovie -g genre -k keywords -nrst title
   -n   only match those that have news or features
   -r   only match those with reviews
   -s   only match those that have showtimes
   -t   only match those that have trailers

and genre can be one of:
  act (Action/Adventure), ada (Adaptation), ani (Animation),
  tee (Teen), thr (Thriller), war (War) or wes (Western).


This makes life easy and pushes the trick of remembering the three-letter abbreviation for the genre onto the user. Sneaky, eh? Now, to be fair, good interface design would have me writing a more sophisticated script that lets users enter a variety of abbreviations (or the full word) and converts them into the proper Yahoo-approved abbreviation, but that's actually work, so we'll skip that too, okay?

Now, note the actual usage I've created:

USAGE: findmovie -g genre -k keywords -nrst title

This means there are a couple elements of the form that we are going to ignore in the script, including which decade the film was released and some of the more obscure conditional parameters. Still, it's enough to keep us busy.

Parsing Parameters with getopts

I've talked about the splendid getopts within shell scripts before, without which parsing the six parameters—two of which have arguments, four of which don't—would be a huge hassle. Instead, this is straightforward. Here are the first few lines to give you the idea:

while getopts "g:k:nrst" arg
  case "$arg" in
    g) params="${params:+$params&}gen=$OPTARG" ;;

There's a lot to talk about here, but we have covered getopts before, and you can <cough> check the man page too, right? In a nutshell though, a letter with a trailing colon means it has a required parameter, so g and k have arguments (g:k:), while n, r, s and t do not (nrst).

The params expansion is a nifty little shell trick that's worth a special mention too. The notation ${params:+$params } expands to the value of the $params variable, plus a trailing space, if the variable already has a value. Otherwise, it's the null string. The point? To avoid leading ampersands in the URL that we're building.

Let's have a quick peek:

$ -g war -k peace -r
finished. params = gen=war&syn=peace&revs=1

As we'd hope, the params variable has been expanded to reflect the specific values that the user has specified on the command line—in this case, War films that have reviews and contain the word “peace” in the synopsis.


Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

A couple of points

ciotog's picture

Nice work, but I have a couple of quibbles:

1. vi vs Emacs detracts from the article - it's largely irrelevant

2. You use params="${params:+$params&}..." to keep off the leading '&' if params hasn't been defined yet, but then you stick params to a string (baseurl) that has a trailing '&', so it's unnecessary. Why not just use params=$params&... and leave the '&' off the end of baseurl? Or stick $params to the end, like so:

The url will have an extra '&' at the end but that's valid.
I recognize that the point is to teach these kinds of things, but to add them when not necessary is generally bad style.

3. Aside from that, in the paragraph where you explain the ${:+} notation you use a space character to illustrate it but it's probably not the best choice. Better to use a more visible character.