Analyze Song Lyrics with a Shell Script, Part II

In my last article, I began exploring song lyrics. Not so you could have an epic Karaoke night, but more in the sense of analyzing song lyrics and word usage therein. The specific question that sparked my curiosity was an article that claimed prolific song-writing duo Paul McCartney and John Lennon mentioned the word "love" in Beatles songs 160 times.

How do you test that assertion? You do it by pulling the lyrics from a Web site that specializes in song lyrics—in this case MLDb—and analyzing them with a shell script.

I wrote the first part in my last article, which was a script that extracted links for every published song lyric attributed to The Beatles, stepping through the every-30 pagination structure of the site. In total, the site lists 240 songs by the band. Out of 240 songs, they mentioned "love" only 160 times? I'm skeptical.

In this article, I expand on the idea by downloading the lyrics to each and every one of those songs, then use some basic command-line tools to analyze word usage and frequency.

Tell Me What You See

The output of the script from my last article is a set of files that have the following contents:


<a href="song-32476-i-am-the-walrus.html">I Am The Walrus
<a href="song-32520-come-together.html">Come Together
<a href="song-32461-yellow-submarine.html">Yellow Submarine
<a href="song-32585-day-tripper.html">Day Tripper
<a href="song-32557-let-it-be.html">Let It Be

Preface the site domain, make it a fully qualified URL, and each song page address looks like this: http://www.mldb.org/song-32520-come-together.html.

Let's go back into the source code and see how the lines are being extracted, because stitching together a better URL and saving its output as a song lyric source file should be easy, right?

Here's the line in question:


curl -s "$url&from=$start" | sed 's/</\
</g' | grep 'href="song-' > $output$start

Instead of just writing it to the output file, however, what if I built a proper URL and handed it to a subroutine that could use that to extract lyrics? Sounds easy, but keep in mind that the above produces a list of 30 songs, not a single song match.

In fact, the easiest solution is to change the code to stick with the output file, but make it a temp file, as it's just for internal use. Then I can step through the file line by line as desired.

First, the simple change in the curl statement:


curl -s "$url&from=$start" | sed 's/</\
</g' | grep 'href="song-' > $tempfile

Next, here's code that can go through the output file, making line-by-line calls to a shell script function:


while read lineofdata
do
  songnum=$(echo $lineofdata | cut -d\" -f2 | cut -d- -f2)
  fullurl="http://www.mldb.org/$(echo $lineofdata | \
     cut -d\" -f2)"
 savelyrics "$songnum" "$fullurl"
done < $tempfile

Why am I saving the song number separately? Because it makes for an easy file output name, as I want to save the lyrics to each and every one of the matching songs. Yes, I could put them in one massive file, but somehow that doesn't seem right.

The work is all done by the savelyrics function, and here's how I've written it, having spent some time fine-tuning the filtering and transformation:


function savelyrics
{
   # extract just the lyrics and save them
   songnum="$1"
   fullurl="$2"

   curl -s "$fullurl" | sed -n '/songtext/,/\/table/p' | \
     sed 's/>/\
/g;s/\<\/p>//g' | grep -E "(<br|</p)" | \
     sed 's/\<br \///g;s/\<\/p//g' | uniq > $output$songnum.txt

   return 0
}

The curl statement gets the web page with the full song lyrics, which are roughly delineated by a CSS class ID of songtext and are contained in a crude HTML table, so the last line of the lyric appears prior to the table closing: </table>.

As I've mentioned before, sed is your friend when you want to extract well delineated passages of text. Use sed -n to stop its usual behavior of echoing everything seen and /start/,/end/p to print just the lines between those two patterns.

The problem is that even when you convert every closing angle bracket into a carriage return (to break the source file into a ton of separate lines for further processing), it's still a bit messy. Most all lyric lines end with the sequence <br />, but the very last line of the lyrics has a </p> instead.

To catch both those lines and screen out everything else, grep has the handy -E flag, which lets you specify a regular expression. Regular expressions are a world unto themselves (which I've delved into in prior columns), but suffice it to say a pattern of the form (A|B) produces lines that have either pattern A or pattern B, exactly as you'd hope.

That's really all the work. The third sed in the pipe simply removes the fragmentary remnant HTML code:


sed 's/\<br \///g;s/\<\/p//g'

(Remember, the format is s/old/new/g for a global substitution. This just looks more complex because "/" is part of the source pattern. The ";" lets you put two sed command sequences on the same line for convenience.)

Do a quick uniq to minimize blank lines, and you're done, ready to save. A sample song lyric output:


$ head lyrics.32586.txt
Try to see it my way
Do I have to keep on talking till I can't go on
While you see it your way
Run the risk of knowing that our love may soon be gone
We can work it out, we can work it out

Think of what you're saying
You can get it wrong and still you think that it's alright
Think of what I'm saying

Know the song? Hear it in your head now? I can definitely keep going with the rest of the lyrics if switching to Karaoke at this point.

Try to See It My Way

I made one more tweak to the script so that the status output as it runs would be interesting. This now appears just before the call to savelyrics:


echo "$lineofdata ($songnum)" | cut -d\> -f2

And so, when run, the script has this sort of output:


$ sh getsongs.sh
I Am The Walrus (32476)
Across The Universe (32554)
Come Together (32520)
Yellow Submarine (32461)
Day Tripper (32585)
. . .
Maggie Mae (61310)
Back In The USSR (61300)
When I'm Sixty-Four (61299)
Good Morning Good Morning (61286)
Got To Get You Into My Life (61285)

Looks good. Here's a quick double-check:


$ ls lyrics.* | wc -l
     240

Got all 240 songs, so let's do some analysis. First off, how many songs have the word "love" in their title? With the new improved script output, that's easy:


$ sh getsongs.sh | grep -i love  | wc -l
      13

Looking across all the songs, how many lyric lines have the word "love"?


$ cat lyrics.* | grep -i love  | wc -l
     445

That's a whole lot more than 160! But, what about lines that have the word love more than once? They'd be counted only once. In fact, a more traditional word analysis could be fun and interesting. Let's start with just a single song, however, the cheerily titled "I'm A Loser":


$ cat lyrics.61278.txt  | tr ' ' '\
' | tr '[[:upper:]]' '[[:lower:]]' | sort | \
  uniq -c | sort -rn | head
  17 i
  13 a
  12 i'm
   9 and
   8 to
   8
   7 loser
   6 have
   5 what
   4 not

Notice that the first tr translates all spaces to carriage returns, the second ensures everything's in lower case (using ANSI set notation for portability), then I simply sort all the words, use uniq -c to generate counts, then reverse sort by numeric count and examine the top ten matches. "I" is the most common word in this song lyric, followed by "a". Not surprising. Notice that "loser" shows up only seven times in the song (all in the reprise, actually).

And, what about if I examine every single song lyric en masse? Here's a surprisingly similar command-line invocation:


$ cat lyrics.*.txt  | tr ' ' '\
' | tr '[[:upper:]]' '[[:lower:]]' | sort | \
  uniq -c | sort -rn | head
5990
1728 you
1475 i
1060 the
 862 to
 781 me
 769 and
 765 a
 438 in
 432 my

These are all what are generally considered "noise words" in semantic analysis, so let's expand the head to include more matches and I'll hand-edit this final result for your reading pleasure:


1728 you
781 me
399 love
366 know
250 she
205 her

There are lots more, but now there's an answer, ladies and gentlemen! I now can say definitively that the word love occurs exactly 399 times in The Beatles songs and 13 times in the group's song titles too (as revealed earlier).

Hello Goodbye

It took a while to get to the solution, but this analysis is a splendid example of what in game theory they call divide and conquer. Take a big problem and keep breaking it down into smaller and smaller parts until you can start to understand how to solve the little pieces. Then build it all back up so you can solve the big challenge.

Now, what about The Monkees? How often did they actually reference monkeys in their song lyrics? Hmm....

Dave Taylor has been hacking shell scripts on UNIX and Linux systems for a really long time. He's the author of Learning Unix for Mac OS X and Wicked Cool Shell Scripts. You can find him on Twitter as @DaveTaylor, and you can reach him through his tech Q&A site: Ask Dave Taylor.

Load Disqus comments