Work the Shell - Of Movies, Trivia Games and Twitter

HOWTOs

by Dave Taylor

on July 1, 2008

During the past few months, I have become an addict. In fact, I went from being a skeptic to being an evangelist, in a way that probably makes me a poster case for a 12-step program. What is this evil thing that's sucked up my brain and passion? It's not illegal; it's not something I have to hide from my children; but, yes, it's pretty geeky, and it's one of the fastest-growing services in the Web 2.0 universe: Twitter.

What I find most compelling about Twitter is that it's both popular and nascent, and as a result, you can see its best practices evolve before your eyes. Even in the few months I've been active with the service, it has gone from just personal updates (as in “Eating burger at McD's. Back to meetings in :30”) to more business uses and news dissemination (“Flash: Redbox hacked by card sniffers. See...”).

In a nutshell, Twitter lets you send very short messages to dozens, hundreds or even thousands of followers, and from a Linux/shell scripting perspective, it's very cool because the API lets you send messages easily with a single line of code. But, let's get there in a bit. First, we need something to transmit.

Movie Trivia? Sure!

Because I can't seem to shake my enthusiasm for writing games as shell scripts (speaking of psychological curiosities, that's another one for you), I thought it would be interesting to write a movie trivia game for Twitter. So, that's what we'll do.

The challenge is to figure out where the data will come from. I mean, I built up a huge database of word history trivia for etymologic.com, and my buddy Kevin Savetz and I wrote more than 500 computer trivia questions for trivial.net, and it's a huge amount of effort. Since creating those sites, I've become too lazy to repeat the effort, so the question is to identify a spot where I can leverage or repurpose existing movie information that will lend itself to a trivia game.

For this effort, I'll use the Internet Movie Database (www.imdb.com), which has an extraordinary amount of interesting movie trivia deep in its database. One place to start is its random movie quote feature, at www.imdb.com/Games/randomquote.html, but truth be told, that trivia is so darn obscure, I've never been able to identify any of the quotes, and I'm quite a movie fanatic.

Let's make this more complicated instead, and start with the IMDb top 250 movies list and isolate the quotes and trivia from those movies. That list is at www.imdb.com/chart/top, and if you crack it open, you'll see that each movie is referenced with a URL of this form http://www.imdb.com/title/tt0068646/.

This means a simple grep can pull out the URL of each and every one of the top 250 movies. Utilizing curl, here's everything you need:


curl -s http://www.imdb.com/chart/top | \
sed 's/</\
/g' | grep '/title/tt' | more

The output isn't quite what we want, but it's getting pretty close to a usable database with just this simple command, not even enough to justify a shell script:

a href="/title/tt0068646/">The Godfather
a href="/title/tt0111161/">The Shawshank Redemption
a href="/title/tt0071562/">The Godfather: Part II
a href="/title/tt0060196/">Buono, il brutto, il cattivo, Il
a href="/title/tt0110912/">Pulp Fiction

To strip out only what we need, because we really just want to have a file of 250 URLs of the top 250 movies, we merely need a tiny addition:


curl -s http://www.imdb.com/chart/top  | sed 's/</\
/g' | grep '/title/tt' | cut -d\" -f2

And, here's the result:

/title/tt0068646/
/title/tt0111161/
/title/tt0071562/
/title/tt0060196/
/title/tt0110912/
...many, many lines skipped...
/title/tt0325980/
/title/tt0061809/
/title/tt0113247/

It's easy to drop this all into a data file, fixing the URLs as we go along so that they are fully qualified, with a simple additional call to sed like this:

| sed 's/^/http:\/\/www.imdb.com/'

Now we have a data file full of URLs, like this:

http://www.imdb.com/title/tt0068646/

Visit this URL, and you'll find that it's the #1 top movie on IMDd, the brilliant film The Godfather.

Scraping Data for Fun

Okay, so we've figured out how to get a list of the top 250 movies according to IMDb voters, but the question is, “how can we get useful information at this point?” The answer is by going to each and every page and scraping the content thereon.

Look at the page for The Godfather, and immediately a simple trivia question game comes to mind: in what year was a particular popular movie released?

This can be done by simply grabbing the title of the page, which just so happens to be the film name and year of release:


curl -s http://www.imdb.com/title/tt0068646/ | grep '<title>'

It's not quite what we want, but pretty darn close:


<title>The Godfather (1972)</title>

It's close enough that we now can write a short script that takes an IMDb movie title URL and outputs the movie name followed by a pipe symbol (a convenient field separator) and the year the film was released:


#!/bin/sh

# given an IMDb film URL, output title & release year

curl -s "$1" | \
  grep '<title>' | cut -d\> -f2 | cut -d\< -f1 | \
  sed 's/([0-9][0-9][0-9][0-9])/| &/' | sed 's/(//;s/)//'

exit 0

(The complicated sed regular expression is to ensure that we don't merely match the open parenthesis, just in case the movie title includes parentheses.)

With that written, now we simply can pour the list into the script and pull a quick list of the top ten films:

for name in $(cat top250.txt)
do
./get-film-info.sh $name
done | head -10

And, here's the output:

The Godfather | 1972
The Shawshank Redemption | 1994
The Godfather: Part II | 1974
Buono, il brutto, il cattivo, Il | 1966
Pulp Fiction | 1994
Schindler's List | 1993
One Flew Over the Cuckoo's Nest | 1975
Star Wars: Episode V - The Empire Strikes Back | 1980
Casablanca | 1942
Shichinin no samurai | 1954

Cool. Now we're getting somewhere. Let's stop here, and next month, I'll look at pulling out a random entry from the 250 entries, then generate three random numbers numerically close to the correct year and present all four as possible answers to the question, “when was XX released?”

For now, I think I'll pop Casablanca in to my Blu-ray player and relax while the team at Linux Journal struggles with laying out the column. See ya later, shweetheart.

Dave Taylor is a 26-year veteran of UNIX, creator of The Elm Mail System, and most recently author of both the best-selling Wicked Cool Shell Scripts and Teach Yourself Unix in 24 Hours, among his 16 technical books. His main Web site is at www.intuitive.com, and he also offers up tech support at AskDaveTaylor.com. Follow him on Twitter if you'd like: twitter.com/DaveTaylor.

Load Disqus comments