Analyzing Song Lyrics

That line alone gets the following:


<a href="song-32476-i-am-the-walrus.html">I Am The Walrus
<a href="song-32461-yellow-submarine.html">Yellow Submarine
<a href="song-32585-day-tripper.html">Day Tripper
<a href="song-32520-come-together.html">Come Together
. . . lots of lines removed for clarity . . .
<a href="song-32395-a-hard-day-s-night.html">A Hard Day's Night
<a href="song-32571-i-want-to-hold-your-hand.html">I Want To Hold
 Your Hand
<a href="song-32527-here-comes-the-sun.html">Here Comes The Sun
<a href="song-32609-i-saw-her-standing-there.html">I Saw Her Standing
 There

Nice. Now how about turning each into a curl page query? Well, hold on! Let's first figure out how to get the full list of every song—that is, how to go from page to page. To do that, the URL already shown has the clue: from=XX for each subsequent page.

Another quick test shows what happens if you specify a URL that is beyond the last song listed: no matches are returned. That's easy to deal with because wc -l will return a zero in that instance.

Put the pieces together, and here's a loop that will get as many matches as possible until there's a zero result:


url="http://www.mldb.org/search?mq=the+beatles&mm=2&si=1&ob=2"
output="lyrics-page." # you can put these in /tmp
start=0   # increment by 30, first page starts at zero
  max=600 # more than 20 pages of matches = artificial stop

while [ $start -lt $max ]
do
  curl -s "$url&from=$start" | sed 's/</\
</g' | grep 'href="song-' > $output$start
  if [ $(wc -l < $output$start) -eq 0 ] ; then
    # zero results page. let's stop, but let's remove it first
    echo "hit a zero results page with start = $start"
    rm "$output$start"
    break
  fi
  start=$(( $start + 30 ))      # increment by 30
done

I'll explain what's going on in the code momentarily, but let's just see what it does and then use an ls invocation to double-check it created non-zero output files:


$ sh getsongs.sh
hit a zero results page with start = 240
$ ls -s lyrics-page*
8 lyrics-page.0      8 lyrics-page.180    8 lyrics-page.60
8 lyrics-page.120    8 lyrics-page.210    8 lyrics-page.90
8 lyrics-page.150    8 lyrics-page.30

Perfect. I expected eight pages of songs, and that's what the script produced. Each has the same format as the output listed earlier, so it's now a matter of converting the href= format into an invocation to curl to get that particular page of lyrics. Since I'm already running out of space, however, I'm going to defer that part of the script until my next article.

Meanwhile, notice how start is incremented by 30 with the $(( )) notation for calculations (you could use expr, but it's faster to stay in the shell and not spawn a subshell for the math). Also, the test to identify an empty output file should be easy for you to understand:


if [ $(wc -l < $output$start) -eq 0 ]

There is a nuance to catch here, however: the $( ) notation gets you a sub-shell akin to using backticks, while the $(( )) notation allows you to do rudimentary calculations within the Bash shell itself.

I'll expand on all of this in my next article. See ya then!

______________________

Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.