Analyzing Song Lyrics

I was reading about the history of The Beatles a few days ago and bumped into an interesting fact. According to the author, The Beatles used the word "love" in their songs more than 160 times. At first I thought, "cool", but the more I thought about it, the more I became skeptical about the figure. In fact, I suspect that the word "love" shows up considerably more than 160 times.

And, this leads to the question: how do you actually figure out something like that? The answer, of course, is with a shell script! So let's jump in, shall we?

Download Lyrics by Artist

The first challenge, and really most of the work, is figuring out where to download the lyrics for every song by an artist, performer or band. There are lots of online archives, but are they complete?

One source is MLDb, the Music Lyrics Database (modeled after the Internet Movie Database, one presumes). An easy test is this: how many songs does the site list for The Beatles?

Working backward from an interactive session in a web browser, an artist search for "the beatles" produces eight pages of matches, 30 matches per page. That's 240 songs. Wikipedia says that there are 237 original compositions for the band, and BeatlesBible.com shows 302 original songs. Confusing!

Of course, some of the songs recorded by The Beatles didn't have lyrics. For example, on the Magical Mystery Tour album, there's a track called "Flying". Given that Paul McCartney and John Lennon were such brilliant lyricists, however, the vast, vast majority of songs recorded have at least some lyrics—even "The End".

So let's go with MLDb and trust that its 240 songs are close enough for this task. Now the challenge is to get a list of all the songs, and then to download the lyrics for each and every song that matches.

Fortunately, that can be done by reverse-engineering the search URLs. The second page of results for an exact-phrase artist search for "the beatles" sorted by rating produces this particular URL: http://www.mldb.org/search?mq=the+beatles&mm=2&si=1&ob=2&from=30.

You can experimentally verify that this produces the second page of results, but hey, let's just run with it! Since the second page has a "from=30", you can conclude that there are 30 entries per page (as mentioned earlier) and that from=60 gets page three, from=90 page four, and so on.

Each page can be downloaded in HTML form using GET or curl, with my preference being to use the latter—it's more sophisticated and has oodles of options. A quick glance shows that "Yellow Submarine" shows up on the first page, so here's a quick test, with url set to the value shown above:


$ curl -s "$url" | grep "Yellow Submarine"
<table id="thelist"
cellspacing="0"><tr><th>Artist(s)</th><th>Song</th>
<th width="20">Rating</th></tr><tr class="h"><td
class="fa"><a
href='artist-39-the-beatles.html'>The Beatles</a></td><td
class="ft"><a
href="song-32476-i-am-the-walrus.html">I Am The Walrus</a></td><td
align="right">6</td></tr><tr class="n"><td class="fa"><a
href='artist-39-the-beatles.html'>The Beatles</a></td><td
class="ft"><a
href="song-32461-yellow-submarine.html">Yellow Submarine</a></td><td
align="right">6</td></tr><tr class="h"><td class="fa"><a
href='artist-39-the-beatles.html'>The Beatles</a></td><td
class="ft"><a
href="song-32585-day-tripper.h...

It turns out that the entire table of lyrics is a single line of HTML. That's a drag, but easily managed. Notice above that the href link to the individual song is of the form:


<a href="song-32461-yellow-submarine.html">Yellow Submarine</a>

That's the pattern I'm going to seek out in the raw HTML, noting that the links to the artist have a single quote, but the links to the lyrics are using double quotes:


curl -s "$url" | grep "Yellow Submarine" | sed 's/</\
</g' | grep 'href="song-'

Notice the sed pattern above. I'm replacing every < with a carriage return followed by the < so that the net effect is that I unwrap the HTML source neatly and then can use grep to isolate the matching lines and exclude everything else.

______________________

Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.