Work the Shell - Of Movies, Trivia Games and Twitter
During the past few months, I have become an addict. In fact, I went from being a skeptic to being an evangelist, in a way that probably makes me a poster case for a 12-step program. What is this evil thing that's sucked up my brain and passion? It's not illegal; it's not something I have to hide from my children; but, yes, it's pretty geeky, and it's one of the fastest-growing services in the Web 2.0 universe: Twitter.
What I find most compelling about Twitter is that it's both popular and nascent, and as a result, you can see its best practices evolve before your eyes. Even in the few months I've been active with the service, it has gone from just personal updates (as in “Eating burger at McD's. Back to meetings in :30”) to more business uses and news dissemination (“Flash: Redbox hacked by card sniffers. See...”).
In a nutshell, Twitter lets you send very short messages to dozens, hundreds or even thousands of followers, and from a Linux/shell scripting perspective, it's very cool because the API lets you send messages easily with a single line of code. But, let's get there in a bit. First, we need something to transmit.
Because I can't seem to shake my enthusiasm for writing games as shell scripts (speaking of psychological curiosities, that's another one for you), I thought it would be interesting to write a movie trivia game for Twitter. So, that's what we'll do.
The challenge is to figure out where the data will come from. I mean, I built up a huge database of word history trivia for etymologic.com, and my buddy Kevin Savetz and I wrote more than 500 computer trivia questions for trivial.net, and it's a huge amount of effort. Since creating those sites, I've become too lazy to repeat the effort, so the question is to identify a spot where I can leverage or repurpose existing movie information that will lend itself to a trivia game.
For this effort, I'll use the Internet Movie Database (www.imdb.com), which has an extraordinary amount of interesting movie trivia deep in its database. One place to start is its random movie quote feature, at www.imdb.com/Games/randomquote.html, but truth be told, that trivia is so darn obscure, I've never been able to identify any of the quotes, and I'm quite a movie fanatic.
Let's make this more complicated instead, and start with the IMDb top 250 movies list and isolate the quotes and trivia from those movies. That list is at www.imdb.com/chart/top, and if you crack it open, you'll see that each movie is referenced with a URL of this form http://www.imdb.com/title/tt0068646/.
This means a simple grep can pull out the URL of each and every one of the top 250 movies. Utilizing curl, here's everything you need:
curl -s http://www.imdb.com/chart/top | \ sed 's/</\ /g' | grep '/title/tt' | more
The output isn't quite what we want, but it's getting pretty close to a usable database with just this simple command, not even enough to justify a shell script:
a href="/title/tt0068646/">The Godfather a href="/title/tt0111161/">The Shawshank Redemption a href="/title/tt0071562/">The Godfather: Part II a href="/title/tt0060196/">Buono, il brutto, il cattivo, Il a href="/title/tt0110912/">Pulp Fiction
To strip out only what we need, because we really just want to have a file of 250 URLs of the top 250 movies, we merely need a tiny addition:
curl -s http://www.imdb.com/chart/top | sed 's/</\ /g' | grep '/title/tt' | cut -d\" -f2
And, here's the result:
/title/tt0068646/ /title/tt0111161/ /title/tt0071562/ /title/tt0060196/ /title/tt0110912/ ...many, many lines skipped... /title/tt0325980/ /title/tt0061809/ /title/tt0113247/
It's easy to drop this all into a data file, fixing the URLs as we go along so that they are fully qualified, with a simple additional call to sed like this:
| sed 's/^/http:\/\/www.imdb.com/'
Now we have a data file full of URLs, like this:
http://www.imdb.com/title/tt0068646/
Visit this URL, and you'll find that it's the #1 top movie on IMDd, the brilliant film The Godfather.
Okay, so we've figured out how to get a list of the top 250 movies according to IMDb voters, but the question is, “how can we get useful information at this point?” The answer is by going to each and every page and scraping the content thereon.
Look at the page for The Godfather, and immediately a simple trivia question game comes to mind: in what year was a particular popular movie released?
This can be done by simply grabbing the title of the page, which just so happens to be the film name and year of release:
curl -s http://www.imdb.com/title/tt0068646/ | grep '<title>'
It's not quite what we want, but pretty darn close:
<title>The Godfather (1972)</title>
It's close enough that we now can write a short script that takes an IMDb movie title URL and outputs the movie name followed by a pipe symbol (a convenient field separator) and the year the film was released:
#!/bin/sh # given an IMDb film URL, output title & release year curl -s "$1" | \ grep '<title>' | cut -d\> -f2 | cut -d\< -f1 | \ sed 's/([0-9][0-9][0-9][0-9])/| &/' | sed 's/(//;s/)//' exit 0
(The complicated sed regular expression is to ensure that we don't merely match the open parenthesis, just in case the movie title includes parentheses.)
With that written, now we simply can pour the list into the script and pull a quick list of the top ten films:
for name in $(cat top250.txt) do ./get-film-info.sh $name done | head -10
And, here's the output:
The Godfather | 1972 The Shawshank Redemption | 1994 The Godfather: Part II | 1974 Buono, il brutto, il cattivo, Il | 1966 Pulp Fiction | 1994 Schindler's List | 1993 One Flew Over the Cuckoo's Nest | 1975 Star Wars: Episode V - The Empire Strikes Back | 1980 Casablanca | 1942 Shichinin no samurai | 1954
Cool. Now we're getting somewhere. Let's stop here, and next month, I'll look at pulling out a random entry from the 250 entries, then generate three random numbers numerically close to the correct year and present all four as possible answers to the question, “when was XX released?”
For now, I think I'll pop Casablanca in to my Blu-ray player and relax while the team at Linux Journal struggles with laying out the column. See ya later, shweetheart.
Dave Taylor is a 26-year veteran of UNIX, creator of The Elm Mail System, and most recently author of both the best-selling Wicked Cool Shell Scripts and Teach Yourself Unix in 24 Hours, among his 16 technical books. His main Web site is at www.intuitive.com, and he also offers up tech support at AskDaveTaylor.com. Follow him on Twitter if you'd like: twitter.com/DaveTaylor.
Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- Dynamic DNS—an Object Lesson in Problem Solving
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Download the Free Red Hat White Paper "Using an Open Source Framework to Catch the Bad Guy"
- Tech Tip: Really Simple HTTP Server with Python
- Roll your own dynamic dns
3 hours 59 min ago - Please correct the URL for Salt Stack's web site
7 hours 11 min ago - Android is Linux -- why no better inter-operation
9 hours 26 min ago - Connecting Android device to desktop Linux via USB
9 hours 55 min ago - Find new cell phone and tablet pc
10 hours 53 min ago - Epistle
12 hours 22 min ago - Automatically updating Guest Additions
13 hours 30 min ago - I like your topic on android
14 hours 17 min ago - This is the easiest tutorial
20 hours 52 min ago - Ahh, the Koolaid.
1 day 2 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
Nice script
I think one of the things I like about your writing, Dave, is that you don't give all the answers...it just wouldn't be fun to merely copy code, but to get the layering of thinking that goes into scripting.
Peace!