# Work the Shell - More Fun with Word and Letter Counts

## HOWTOs

by Dave Taylor

If you can remember back a month, you'll recall that I'd received a blessed e-mail from someone (hint, hint) asking:

Dear Dave, I seek an edge when I next play Hangman or other word games. I want to know what words are most common in the English language and what letters are most common in written material too. If you can show how to do that as a shell script, it'd be useful for your column, but if not, can you point me to an on-line resource? Thanks.—Mike R.

I grabbed three books from the Project Gutenberg archive (gutenberg.org) to analyze and use as test input: Dracula by Bram Stoker, History of the United States by Charles A. Beard and Mary Ritter Beard, and Pride and Prejudice by Jane Austen.

The obvious way to analyze these text files is with the wc command, which reveals that, combined, we're looking at 497,663 words, 2.9 million characters.

We used the following to identify the most common words:

```\$ cat *txt | tr ' ' '\012' | \
tr '[:upper:]' '[:lower:]' | \
tr -d '[:punct:]' | grep -v '[^a-z]' | \
sort | uniq -c | sort -rn | head
```

The results were sufficient to reveal that the top ten words that appear in our 500,000-word sample are, in order: the, of, and, to, in, a, i, was, that and it.

Now, let's go in a different direction and analyze letter frequency. Then, we'll go back to finding interesting and unusual words.

Calculating Letter Frequency

The question underlying calculating letter frequency is this: “how do you break down a word into individual letters so that you have one letter per line?” It turns out that the handy Linux tool fold can do exactly what we want:

```\$ echo hello | fold -w1
h
e
l
l
o
```

Neatly done! (Note that you can't use fmt or similar commands because even if you specify -w1 for width, it works with words, not characters.)

It's an easy leap from there to make fold break down every single word in a text file, sort the results, and use our power duo of uniq -c | sort -rn to get the results we seek:

```
\$ fold -w1 < dracula.txt | sort | \
uniq -c | sort -rn | head
157559
78409 e
56524 t
51608 a
50568 o
43453 n
41749 h
38150 s
37950 i
35001 r

```

A blank is the most common, but we can skip that visually rather than complicate our pipe with yet another process.

As I said in the beginning, E is the most common letter, but it's a surprise to see T as the second most common, frankly. Maybe it's because we're not compensating for upper-/lowercase? Let's try again:

```
\$ fold -w1 < dracula.txt | sort | \
tr '[:lower:]' '[:upper:]' | uniq -c | \
157559
78409 E
56524 T
51608 A
50568 O

```

Wait a minute. We shouldn't get the same result! Hmmm...can you see what I've done wrong? Hint: look at the order of commands in the pipe.

Got it? The tr needs to appear before the first sort command, or it transforms the output, but after it already has been sorted separately. We also should strip out punctuation, which can be done with the tr command as well. Here's a better attempt:

```
\$ fold -w1 < dracula.txt | \
tr '[:lower:]' '[:upper:]' | sort | \
tr -d '[:punct:]' | uniq -c | \
157559
79011 E
58618 T
53146 A
51122 O
43975 N
43501 H
43423 I
39296 S
35607 R

```

Will this ordering change if we use all three of our books rather than just Dracula? Let's try it:

```\$ cat *.txt | fold -w1 | \
tr '[:lower:]' '[:upper:]' | sort | \
tr -d '[:punct:]' | uniq -c | \
468727
273409 E
201726 T
175637 A
169836 O
158561 N
155910 I
135513 S
133927 R
127716 H
```

Same result! In order of frequency, the letters appear in text in the following sequence: E T A O N I S R H D L C U M F W G P Y B V K X J Q Z. (I'm a bit surprised that J shows up so infrequently.)

You now know what order to guess letters in Hangman, if nothing else.

Speaking of Hangman

Before we wrap this up, let's go back through the words in our corpus and find just those that are at least ten letters long and occur infrequently. Here's how I'll do that:

```\$ cat *.txt | tr ' ' '\012' | \
tr '[:upper:]' '[:lower:]' | \
tr -d '[:punct:]' | tr -d '[0-9]' | \
sort | uniq -c | sort -n | \
1 abolitionists
1 accommodation
1 accommodations
1 accomplishing
1 accomplishments
1 accountability
1 achievements
1 acknowledging
1 acknowledgments
1 acquaintanceship