Work the Shell - Of Movies, Trivia Games and Twitter

 in
How to write a movie trivia game for Twitter.

During the past few months, I have become an addict. In fact, I went from being a skeptic to being an evangelist, in a way that probably makes me a poster case for a 12-step program. What is this evil thing that's sucked up my brain and passion? It's not illegal; it's not something I have to hide from my children; but, yes, it's pretty geeky, and it's one of the fastest-growing services in the Web 2.0 universe: Twitter.

What I find most compelling about Twitter is that it's both popular and nascent, and as a result, you can see its best practices evolve before your eyes. Even in the few months I've been active with the service, it has gone from just personal updates (as in “Eating burger at McD's. Back to meetings in :30”) to more business uses and news dissemination (“Flash: Redbox hacked by card sniffers. See...”).

In a nutshell, Twitter lets you send very short messages to dozens, hundreds or even thousands of followers, and from a Linux/shell scripting perspective, it's very cool because the API lets you send messages easily with a single line of code. But, let's get there in a bit. First, we need something to transmit.

Movie Trivia? Sure!

Because I can't seem to shake my enthusiasm for writing games as shell scripts (speaking of psychological curiosities, that's another one for you), I thought it would be interesting to write a movie trivia game for Twitter. So, that's what we'll do.

The challenge is to figure out where the data will come from. I mean, I built up a huge database of word history trivia for etymologic.com, and my buddy Kevin Savetz and I wrote more than 500 computer trivia questions for trivial.net, and it's a huge amount of effort. Since creating those sites, I've become too lazy to repeat the effort, so the question is to identify a spot where I can leverage or repurpose existing movie information that will lend itself to a trivia game.

For this effort, I'll use the Internet Movie Database (www.imdb.com), which has an extraordinary amount of interesting movie trivia deep in its database. One place to start is its random movie quote feature, at www.imdb.com/Games/randomquote.html, but truth be told, that trivia is so darn obscure, I've never been able to identify any of the quotes, and I'm quite a movie fanatic.

Let's make this more complicated instead, and start with the IMDb top 250 movies list and isolate the quotes and trivia from those movies. That list is at www.imdb.com/chart/top, and if you crack it open, you'll see that each movie is referenced with a URL of this form http://www.imdb.com/title/tt0068646/.

This means a simple grep can pull out the URL of each and every one of the top 250 movies. Utilizing curl, here's everything you need:


curl -s http://www.imdb.com/chart/top | \
sed 's/</\
/g' | grep '/title/tt' | more

The output isn't quite what we want, but it's getting pretty close to a usable database with just this simple command, not even enough to justify a shell script:

a href="/title/tt0068646/">The Godfather
a href="/title/tt0111161/">The Shawshank Redemption
a href="/title/tt0071562/">The Godfather: Part II
a href="/title/tt0060196/">Buono, il brutto, il cattivo, Il
a href="/title/tt0110912/">Pulp Fiction

To strip out only what we need, because we really just want to have a file of 250 URLs of the top 250 movies, we merely need a tiny addition:


curl -s http://www.imdb.com/chart/top  | sed 's/</\
/g' | grep '/title/tt' | cut -d\" -f2

And, here's the result:

/title/tt0068646/
/title/tt0111161/
/title/tt0071562/
/title/tt0060196/
/title/tt0110912/
...many, many lines skipped...
/title/tt0325980/
/title/tt0061809/
/title/tt0113247/

It's easy to drop this all into a data file, fixing the URLs as we go along so that they are fully qualified, with a simple additional call to sed like this:

| sed 's/^/http:\/\/www.imdb.com/'

Now we have a data file full of URLs, like this:

http://www.imdb.com/title/tt0068646/

Visit this URL, and you'll find that it's the #1 top movie on IMDd, the brilliant film The Godfather.

Scraping Data for Fun

Okay, so we've figured out how to get a list of the top 250 movies according to IMDb voters, but the question is, “how can we get useful information at this point?” The answer is by going to each and every page and scraping the content thereon.

Look at the page for The Godfather, and immediately a simple trivia question game comes to mind: in what year was a particular popular movie released?

This can be done by simply grabbing the title of the page, which just so happens to be the film name and year of release:


curl -s http://www.imdb.com/title/tt0068646/ | grep '<title>'

It's not quite what we want, but pretty darn close:


<title>The Godfather (1972)</title>

It's close enough that we now can write a short script that takes an IMDb movie title URL and outputs the movie name followed by a pipe symbol (a convenient field separator) and the year the film was released:


#!/bin/sh

# given an IMDb film URL, output title & release year

curl -s "$1" | \
  grep '<title>' | cut -d\> -f2 | cut -d\< -f1 | \
  sed 's/([0-9][0-9][0-9][0-9])/| &/' | sed 's/(//;s/)//'

exit 0

(The complicated sed regular expression is to ensure that we don't merely match the open parenthesis, just in case the movie title includes parentheses.)

With that written, now we simply can pour the list into the script and pull a quick list of the top ten films:

for name in $(cat top250.txt)
do
./get-film-info.sh $name
done | head -10

And, here's the output:

The Godfather | 1972
The Shawshank Redemption | 1994
The Godfather: Part II | 1974
Buono, il brutto, il cattivo, Il | 1966
Pulp Fiction | 1994
Schindler's List | 1993
One Flew Over the Cuckoo's Nest | 1975
Star Wars: Episode V - The Empire Strikes Back | 1980
Casablanca | 1942
Shichinin no samurai | 1954

Cool. Now we're getting somewhere. Let's stop here, and next month, I'll look at pulling out a random entry from the 250 entries, then generate three random numbers numerically close to the correct year and present all four as possible answers to the question, “when was XX released?”

For now, I think I'll pop Casablanca in to my Blu-ray player and relax while the team at Linux Journal struggles with laying out the column. See ya later, shweetheart.

Dave Taylor is a 26-year veteran of UNIX, creator of The Elm Mail System, and most recently author of both the best-selling Wicked Cool Shell Scripts and Teach Yourself Unix in 24 Hours, among his 16 technical books. His main Web site is at www.intuitive.com, and he also offers up tech support at AskDaveTaylor.com. Follow him on Twitter if you'd like: twitter.com/DaveTaylor.

______________________

Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Nice script

vergil66's picture

I think one of the things I like about your writing, Dave, is that you don't give all the answers...it just wouldn't be fun to merely copy code, but to get the layering of thinking that goes into scripting.
Peace!

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState