Work the Shell - Still Parsing the Twitter Stream

 in
How do you keep track of which tweets you've already answered?

Last month, you'll hopefully remember that we took the big step in our Twitter stream parsing program of actually having it parse the incoming messages and strip out quotes and other HTML noise. I also republished the send-tweet script too, which we'll use this month.

The biggest challenge we face with the tweet-parser is knowing what messages we've already answered and which are new since the last time the program was run. The solution? To go back and tweak the original script a bit. It turns out that each and every tweet has a unique ID value, as you can see here:


<id>2541771</id>

You'll recall that early in the script we have this grep command:


grep -E '(<screen_name>|<text>)' | \

Simple enough. We'll tweak it to include |<id> and grab that value too. Except, of course, it's not that simple. It turns out that two <id> strings show up in the XML data from Twitter: one that's the ID of the account sending the message, and another that's the ID of the message itself—both conveniently labeled the same. Ugh!

Timestamps and Tricky XML

I can kvetch and wish Twitter would fix its XML to have USERID or similar, but what's the point? They have the same thing with the overloaded <created_at> tag too, so we're going to have to bite the bullet and accept that we are now grabbing four data fields from the XML feed, only three of which we care about.

Once we know that we're going to have four lines of output, cyclically, we simply can decide which of those are actually important and tweak them in the awk statement:


$curl -u "davetaylor:$pw" $inurl | \
  grep -E '(<screen_name>|<text>|<id>)' | \
  sed 's/@DaveTaylor //;s/  <text>//;s/<\/text>//' | \
  sed 's/ *<screen_name>//;s/<\/screen_name>//' | \
  sed 's/ *<id>//;s/<\/id>//' | \
  awk '{ if (NR % 4 == 0) {
           printf ("name=%s; ", $0) }
         else if (NR % 4 == 1) {
           printf("id=%s; ",$0) }
         else if (NR % 4 == 2) {
           print "msg=\"" $0 "\"" }
       }' > $temp

That's a pretty complicated sequence, so let's look at the awk conditional statement a little closer. We have four input records (lines) that we're stepping through. The value of NR is the number of records processed so far. So if NR mod 4 equals 0, it's the first of the four records (lines). The first record is the name value.

Did you see that two lines have printf, and the third uses a simpler print statement? Since we want each set of variables on a separate line, we use the print statement, because it automatically appends a newline to the output. Of course, the same effect could be achieved by putting the newline as a format string passed to printf. Example output follows:

name=thattalldude; id=6507045947; msg="Rates?"
name=KateC; id=6507034680; msg="hours"
name=pbarbanes; id=6507033698; msg="thanks"
name=jodie_nodes; id=6507022063; msg=" $$?"
name=KateC; id=6507019757; msg="price"
name=tarahn; id=6507008559; msg="impact"
name=GaryH2UK; id=6507004771; msg="directions"

We're going to hand these again, line by line, to the eval statement to set the three variables: name, id and msg. Then, it's a simple parsing problem, comparing msg to the known queries we have. Basically, it's what we did last month, except this time, every single tweet also has a unique ID value associated with it.

A typical test might now look like this:

if [ "$msg" == "hours" ] ; then
  echo "@$name asked what our hours are in tweet $id"
fi

Nice! It's simple, straightforward and well worth the preprocessing hoops we've jumped through.

Working with IDs Included

Indeed, I run that against my Twitter stream (after asking people to send me sample queries), and here's what I see:

@TheNose100 asked what our hours are in tweet 6507436100
@crepeauf asked what our hours are in tweet 6507187325
@jdscott asked what our hours are in tweet 6507087136
@KateC asked what our hours are in tweet 6507034680
@inspiremetoday asked what our hours are in tweet 6506966654

I bet you can see how to proceed from here. We write static responses, calculate values as needed and use send-tweet to respond to the user:

$tweet "@$name our hours are Mon-Fri 9-5, Sat 10-4."

For fun, I'll let people send the query “time” and get the current output of the date command too, just to demonstrate how that might work. Here's the code block:

if [ "$msg" == "time" ] ; then
  echo "@$id asked for the time"
  $tweet "@$name the local time on our server is $(date)"
fi

Great. Got it all, except for where we started out. How do you track which tweets you've already answered?

______________________

Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix