Work the Shell - Counting Words and Letters
I know I have been writing about the basics of working with variables in shell scripts, but I'm going to diverge and address a recent query I received. Okay? (And, hey, write to me.)
“Dear Dave, I seek an edge when I next play Hangman or other word games. I want to know what words are most common in the English language and what letters are most common in written material too. If you can show how to do that as a shell script, it'd be useful for your column, but if not, can you point me to an on-line resource? Thanks, Mike R.”
Okay, I can tell you up front, Mike, that the secret to playing Hangman is to ensure that you have sufficient guesses to get at least 30% of the letters before you're in great peril. Oh, that's not what you seek, is it? The first letter to guess, always, is E, which is the most common letter in the English language. If you have a Scrabble set, you also can figure out the frequency of letters, because the points for individual letters are inversely proportional to their frequency. That is, E is worth one point, while the Q and Z—two very uncommon letters in English—are worth ten points each.
But, let's work on a shell script to verify and prove all this, shall we?
The first step is to find some written material to analyze. That's easily done by going to one of my favorite places on the Web, the Gutenberg Project. You can pop there too at www.gutenberg.org.
With thousands and thousands of books available in free, downloadable form, let's grab only three: Dracula by Bram Stoker, History of the United States by Charles A. Beard and Mary Ritter Beard, and Pride and Prejudice by Jane Austen. They're all obviously a bit older, but that's okay for our purposes. To make life easy, I'll download them as plain text and leave the geeky introduction to the Gutenberg Project at the top of each file too, just for more word variation and, well, because I'm lazy. Okay with you, dear reader?
Here's a quick heads up on the three:
$ wc *txt 16624 163798 874627 dracula.txt 24398 209289 1332539 history-united-states.txt 13426 124576 717558 pride-prejudice.txt 54448 497663 2924724 total
Okay, so we have 54,448 lines of text, representing 497,663 words and 2,924,724 characters. That's a lot of text.
The key to figuring out any of our desired statistics is to realize that the basic strategy we need to use is to break the content down into smaller pieces, sort them, and then use the great uniq -c capability, which de-dupes the input stream, counting frequency as it goes. As a shell pipe, we're talking about sort | uniq -c, coupled with whatever command we need to break down the individual entities.
For this task, I'm going to use tr, like this, to convert spaces to newlines:
$ cat *txt | tr ' ' '\ ' | head The Project Gutenberg EBook of Dracula, by Bram Stoker
Okay, so what happens when we actually unleash the beast on all 54,448 lines of our combined text?
$ cat *txt | tr ' ' '\ > ' | wc -l 526104
That's strange. Somehow I would expect that breaking down every line by space delimiter should be fairly close to the word count of wc, but most likely the document has punctuation like “the end. The next” where a double space becomes two, not one line. No worries, though, it'll all vanish once we take the next step.
Now that we have the ability to break down our documents into individual words, let's sort and “uniq” it to see what we see:
$ cat *txt | tr ' ' '\ ' | sort | uniq | wc -l 52407
But, that's not right. Do you know why?
If you said, “Dude! You need to account for capitalization!”, you'd be on the right track. In fact, we need to transliterate everything to lowercase. We also need to strip out all the punctuation as well, because right now it's counting “cat,” and “cat” as two different words—not good.
First off, transliteration is best done with a character group rather than with a letter range. In tr, it's a bit funky with the [::] notation:
$ echo "Hello" | tr '[:upper:]' '[:lower:]' hello
Stripping out punctuation is a wee bit trickier, but not much. Again, we can use a character class in tr:
$ echo "this, and? that! for sure." | tr -d '[:punct:]' this and that for sure
Coolness, eh? I bet you didn't know you could do that! Now, let's put it all together:
$ cat *txt | tr ' ' '\ ' | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | sort | uniq | wc -l 28855
So, that chops it down from 52,407 to 28,855—makes sense to me. One more transform is needed though. Let's strip out all lines that don't contain alphabetic characters to eliminate digits. That can be done with a simple grep -v '[^a-z]'":
$ cat *txt | tr ' ' '\ ' | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | grep -v '[^a-z]' | sort | uniq | wc -l 19,820
If you analyze only Dracula, by the way, it turns out that the entire book has only 9,434 unique words. Useful, eh?
Now, finally, let's tweak things just a bit and see the ten most common words in this corpus:
$ cat *txt | tr ' ' '\ ' | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | grep -v '[^a-z]' | sort | uniq -c | sort -rn | head 29247 the 19925 16995 of 14715 and 13010 to 9293 in 7894 a 6474 i 5724 was 5206 that
And, now you know.
Next month, I'll wrap this up by showing how you can analyze individual letter occurrences too, and finally, I'll offer a way to find some great Hangman words for stumping your friends.
Dave Taylor is a 26-year veteran of UNIX, creator of The Elm Mail System, and most recently author of both the best-selling Wicked Cool Shell Scripts and Teach Yourself Unix in 24 Hours, among his 16 technical books. His main Web site is at www.intuitive.com, and he also offers up tech support at AskDaveTaylor.com. You also can follow Dave on Twitter through twitter.com/DaveTaylor.
Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- New Products
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




1 hour 32 min ago
1 hour 48 min ago
3 hours 39 min ago
9 hours 31 min ago
14 hours 2 min ago
14 hours 3 min ago
16 hours 3 min ago
1 day 49 min ago
1 day 1 hour ago
1 day 2 hours ago