Work the Shell - More Fun with Word and Letter Counts
If you can remember back a month, you'll recall that I'd received a blessed e-mail from someone (hint, hint) asking:
Dear Dave, I seek an edge when I next play Hangman or other word games. I want to know what words are most common in the English language and what letters are most common in written material too. If you can show how to do that as a shell script, it'd be useful for your column, but if not, can you point me to an on-line resource? Thanks.—Mike R.
I grabbed three books from the Project Gutenberg archive (gutenberg.org) to analyze and use as test input: Dracula by Bram Stoker, History of the United States by Charles A. Beard and Mary Ritter Beard, and Pride and Prejudice by Jane Austen.
The obvious way to analyze these text files is with the wc command, which reveals that, combined, we're looking at 497,663 words, 2.9 million characters.
We used the following to identify the most common words:
$ cat *txt | tr ' ' '\012' | \
tr '[:upper:]' '[:lower:]' | \
tr -d '[:punct:]' | grep -v '[^a-z]' | \
sort | uniq -c | sort -rn | head
The results were sufficient to reveal that the top ten words that appear in our 500,000-word sample are, in order: the, of, and, to, in, a, i, was, that and it.
Now, let's go in a different direction and analyze letter frequency. Then, we'll go back to finding interesting and unusual words.
The question underlying calculating letter frequency is this: “how do you break down a word into individual letters so that you have one letter per line?” It turns out that the handy Linux tool fold can do exactly what we want:
$ echo hello | fold -w1 h e l l o
Neatly done! (Note that you can't use fmt or similar commands because even if you specify -w1 for width, it works with words, not characters.)
It's an easy leap from there to make fold break down every single word in a text file, sort the results, and use our power duo of uniq -c | sort -rn to get the results we seek:
$ fold -w1 < dracula.txt | sort | \
uniq -c | sort -rn | head
157559
78409 e
56524 t
51608 a
50568 o
43453 n
41749 h
38150 s
37950 i
35001 r
A blank is the most common, but we can skip that visually rather than complicate our pipe with yet another process.
As I said in the beginning, E is the most common letter, but it's a surprise to see T as the second most common, frankly. Maybe it's because we're not compensating for upper-/lowercase? Let's try again:
$ fold -w1 < dracula.txt | sort | \
tr '[:lower:]' '[:upper:]' | uniq -c | \
sort -rn | head -5
157559
78409 E
56524 T
51608 A
50568 O
Wait a minute. We shouldn't get the same result! Hmmm...can you see what I've done wrong? Hint: look at the order of commands in the pipe.
Got it? The tr needs to appear before the first sort command, or it transforms the output, but after it already has been sorted separately. We also should strip out punctuation, which can be done with the tr command as well. Here's a better attempt:
$ fold -w1 < dracula.txt | \
tr '[:lower:]' '[:upper:]' | sort | \
tr -d '[:punct:]' | uniq -c | \
sort -rn | head
157559
79011 E
58618 T
53146 A
51122 O
43975 N
43501 H
43423 I
39296 S
35607 R
Will this ordering change if we use all three of our books rather than just Dracula? Let's try it:
$ cat *.txt | fold -w1 | \
tr '[:lower:]' '[:upper:]' | sort | \
tr -d '[:punct:]' | uniq -c | \
sort -rn | head
468727
273409 E
201726 T
175637 A
169836 O
158561 N
155910 I
135513 S
133927 R
127716 H
Same result! In order of frequency, the letters appear in text in the following sequence: E T A O N I S R H D L C U M F W G P Y B V K X J Q Z. (I'm a bit surprised that J shows up so infrequently.)
You now know what order to guess letters in Hangman, if nothing else.
Before we wrap this up, let's go back through the words in our corpus and find just those that are at least ten letters long and occur infrequently. Here's how I'll do that:
$ cat *.txt | tr ' ' '\012' | \
tr '[:upper:]' '[:lower:]' | \
tr -d '[:punct:]' | tr -d '[0-9]' | \
sort | uniq -c | sort -n | \
grep -E '..................' | head
1 abolitionists
1 accommodation
1 accommodations
1 accomplishing
1 accomplishments
1 accountability
1 achievements
1 acknowledging
1 acknowledgments
1 acquaintanceship
1 administrative
1 advertisement
That gives us long words that occur infrequently in the English language—or, at least only once in the 500,000 word corpus we've been analyzing.
(True confession: I simply added more and more dots to the grep regular expression until I weeded out almost all of the results. I could also have used .{10,} to get ten-character or longer matches.)
Some of these words obviously are more common in everyday parlance than in these particular books, however, such as advertisement, which I'm sure occurs more than once every 500,000 words in normal conversation, or at least in the circles I frequent!
What would really be great for Hangman would be to apply the letter-frequency rule further, so that you extract the infrequently occurring words, then come up with a sum value for the frequency of each letter in the word (I'd assign E = 1, T = 2, A = 3, O = 4, for example) and identify the longest words with the highest scores. Those will be your very best Hangman words.
But, I'm out of space and last I checked, I was supposed to be writing about different variable reference formats in shell scripts anyway. I swear, next column, I'll get back to that. Unless you (hint, hint) write me a note with a puzzle or scripting challenge to solve.
Dave Taylor has been involved with UNIX since he first logged in to the ARPAnet in 1980. That means, yes, he's coming up to the 30-year mark now. You can find him just about everywhere on-line, but start here: www.DaveTaylorOnline.com.
Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- Linux Systems Administrator
- Senior Perl Developer
- New Products
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Using Salt Stack and Vagrant for Drupal Development
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Have you tried Boxen? It's a
15 min 59 sec ago - seo services in india
4 hours 47 min ago - For KDE install kio-mtp
4 hours 48 min ago - Evernote is much more...
6 hours 48 min ago - Reply to comment | Linux Journal
15 hours 33 min ago - Dynamic DNS
16 hours 7 min ago - Reply to comment | Linux Journal
17 hours 6 min ago - Reply to comment | Linux Journal
17 hours 56 min ago - Not free anymore
21 hours 58 min ago - Great
1 day 1 hour ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
Letter frequency
Samuel Morse would have loved your article, as I did. In devising Morse code, he assigned the briefest codes to the most frequently used letters. Lacking your routine and access to /usr/share/dict/words, he based his count on the number of letters in sets of printers' type, according to AskOxford.com (http://www.askoxford.com/asktheexperts/faq/aboutwords/frequency). The figures he came up with for the most common letters were:
12,000 E
9,000 T
8,000 A, I, N, O, S
In contrast, Zs occupied the least letter space in typesetters' cabinets, with only 200 on hand.
As a result, E and T in Morse code require single key presses, one short and the other long (dot and dash, respectively). Other letters may require four key presses, combining dots and/or dashes. Optimization makes a difference; even so sending 5 words per minute in Morse code is a challenge for the novice; 20 words per minute the mark of a pro!
AskOxford.com poses some interesting questions that shell scripters could have fun with. For example,
Are there any English words containing the same letter three times in a row?
Are there any words in the English language that use all five vowels with no intervening consonants or have the five vowels in the right order?
Too easy? Try: What is the longest one-syllable English word?
I remember before you could look virtually everything up on the Internet, being surprised how difficult it was to code a routine for breaking words into syllables. Not because the coding itself was hard but because detailing the underlying rules was such a challenge.