Work the Shell - Of Movies, Trivia Games and Twitter
During the past few months, I have become an addict. In fact, I went from being a skeptic to being an evangelist, in a way that probably makes me a poster case for a 12-step program. What is this evil thing that's sucked up my brain and passion? It's not illegal; it's not something I have to hide from my children; but, yes, it's pretty geeky, and it's one of the fastest-growing services in the Web 2.0 universe: Twitter.
What I find most compelling about Twitter is that it's both popular and nascent, and as a result, you can see its best practices evolve before your eyes. Even in the few months I've been active with the service, it has gone from just personal updates (as in “Eating burger at McD's. Back to meetings in :30”) to more business uses and news dissemination (“Flash: Redbox hacked by card sniffers. See...”).
In a nutshell, Twitter lets you send very short messages to dozens, hundreds or even thousands of followers, and from a Linux/shell scripting perspective, it's very cool because the API lets you send messages easily with a single line of code. But, let's get there in a bit. First, we need something to transmit.
Because I can't seem to shake my enthusiasm for writing games as shell scripts (speaking of psychological curiosities, that's another one for you), I thought it would be interesting to write a movie trivia game for Twitter. So, that's what we'll do.
The challenge is to figure out where the data will come from. I mean, I built up a huge database of word history trivia for etymologic.com, and my buddy Kevin Savetz and I wrote more than 500 computer trivia questions for trivial.net, and it's a huge amount of effort. Since creating those sites, I've become too lazy to repeat the effort, so the question is to identify a spot where I can leverage or repurpose existing movie information that will lend itself to a trivia game.
For this effort, I'll use the Internet Movie Database (www.imdb.com), which has an extraordinary amount of interesting movie trivia deep in its database. One place to start is its random movie quote feature, at www.imdb.com/Games/randomquote.html, but truth be told, that trivia is so darn obscure, I've never been able to identify any of the quotes, and I'm quite a movie fanatic.
Let's make this more complicated instead, and start with the IMDb top 250 movies list and isolate the quotes and trivia from those movies. That list is at www.imdb.com/chart/top, and if you crack it open, you'll see that each movie is referenced with a URL of this form http://www.imdb.com/title/tt0068646/.
This means a simple grep can pull out the URL of each and every one of the top 250 movies. Utilizing curl, here's everything you need:
curl -s http://www.imdb.com/chart/top | \ sed 's/</\ /g' | grep '/title/tt' | more
The output isn't quite what we want, but it's getting pretty close to a usable database with just this simple command, not even enough to justify a shell script:
a href="/title/tt0068646/">The Godfather a href="/title/tt0111161/">The Shawshank Redemption a href="/title/tt0071562/">The Godfather: Part II a href="/title/tt0060196/">Buono, il brutto, il cattivo, Il a href="/title/tt0110912/">Pulp Fiction
To strip out only what we need, because we really just want to have a file of 250 URLs of the top 250 movies, we merely need a tiny addition:
curl -s http://www.imdb.com/chart/top | sed 's/</\ /g' | grep '/title/tt' | cut -d\" -f2
And, here's the result:
/title/tt0068646/ /title/tt0111161/ /title/tt0071562/ /title/tt0060196/ /title/tt0110912/ ...many, many lines skipped... /title/tt0325980/ /title/tt0061809/ /title/tt0113247/
It's easy to drop this all into a data file, fixing the URLs as we go along so that they are fully qualified, with a simple additional call to sed like this:
| sed 's/^/http:\/\/www.imdb.com/'
Now we have a data file full of URLs, like this:
http://www.imdb.com/title/tt0068646/
Visit this URL, and you'll find that it's the #1 top movie on IMDd, the brilliant film The Godfather.
Okay, so we've figured out how to get a list of the top 250 movies according to IMDb voters, but the question is, “how can we get useful information at this point?” The answer is by going to each and every page and scraping the content thereon.
Look at the page for The Godfather, and immediately a simple trivia question game comes to mind: in what year was a particular popular movie released?
This can be done by simply grabbing the title of the page, which just so happens to be the film name and year of release:
curl -s http://www.imdb.com/title/tt0068646/ | grep '<title>'
It's not quite what we want, but pretty darn close:
<title>The Godfather (1972)</title>
It's close enough that we now can write a short script that takes an IMDb movie title URL and outputs the movie name followed by a pipe symbol (a convenient field separator) and the year the film was released:
#!/bin/sh # given an IMDb film URL, output title & release year curl -s "$1" | \ grep '<title>' | cut -d\> -f2 | cut -d\< -f1 | \ sed 's/([0-9][0-9][0-9][0-9])/| &/' | sed 's/(//;s/)//' exit 0
(The complicated sed regular expression is to ensure that we don't merely match the open parenthesis, just in case the movie title includes parentheses.)
With that written, now we simply can pour the list into the script and pull a quick list of the top ten films:
for name in $(cat top250.txt) do ./get-film-info.sh $name done | head -10
And, here's the output:
The Godfather | 1972 The Shawshank Redemption | 1994 The Godfather: Part II | 1974 Buono, il brutto, il cattivo, Il | 1966 Pulp Fiction | 1994 Schindler's List | 1993 One Flew Over the Cuckoo's Nest | 1975 Star Wars: Episode V - The Empire Strikes Back | 1980 Casablanca | 1942 Shichinin no samurai | 1954
Cool. Now we're getting somewhere. Let's stop here, and next month, I'll look at pulling out a random entry from the 250 entries, then generate three random numbers numerically close to the correct year and present all four as possible answers to the question, “when was XX released?”
For now, I think I'll pop Casablanca in to my Blu-ray player and relax while the team at Linux Journal struggles with laying out the column. See ya later, shweetheart.
Dave Taylor is a 26-year veteran of UNIX, creator of The Elm Mail System, and most recently author of both the best-selling Wicked Cool Shell Scripts and Teach Yourself Unix in 24 Hours, among his 16 technical books. His main Web site is at www.intuitive.com, and he also offers up tech support at AskDaveTaylor.com. Follow him on Twitter if you'd like: twitter.com/DaveTaylor.
Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- New Products
- RSS Feeds
- Trying to Tame the Tablet
- What's the tweeting protocol?
- Dart: a New Web Programming Experience
- Drupal is an Awesome CMS and a Crappy development framework
4 hours 23 min ago - IT industry leaders
6 hours 46 min ago - Reply to comment | Linux Journal
23 hours 34 min ago - Reply to comment | Linux Journal
1 day 2 hours ago - Reply to comment | Linux Journal
1 day 3 hours ago - great post
1 day 3 hours ago - Google Docs
1 day 4 hours ago - Reply to comment | Linux Journal
1 day 9 hours ago - Reply to comment | Linux Journal
1 day 9 hours ago - Web Hosting IQ
1 day 11 hours ago
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




Comments
Nice script
I think one of the things I like about your writing, Dave, is that you don't give all the answers...it just wouldn't be fun to merely copy code, but to get the layering of thinking that goes into scripting.
Peace!