Work the Shell - Still Parsing the Twitter Stream

 in
How do you keep track of which tweets you've already answered?

Last month, you'll hopefully remember that we took the big step in our Twitter stream parsing program of actually having it parse the incoming messages and strip out quotes and other HTML noise. I also republished the send-tweet script too, which we'll use this month.

The biggest challenge we face with the tweet-parser is knowing what messages we've already answered and which are new since the last time the program was run. The solution? To go back and tweak the original script a bit. It turns out that each and every tweet has a unique ID value, as you can see here:


<id>2541771</id>

You'll recall that early in the script we have this grep command:


grep -E '(<screen_name>|<text>)' | \

Simple enough. We'll tweak it to include |<id> and grab that value too. Except, of course, it's not that simple. It turns out that two <id> strings show up in the XML data from Twitter: one that's the ID of the account sending the message, and another that's the ID of the message itself—both conveniently labeled the same. Ugh!

Timestamps and Tricky XML

I can kvetch and wish Twitter would fix its XML to have USERID or similar, but what's the point? They have the same thing with the overloaded <created_at> tag too, so we're going to have to bite the bullet and accept that we are now grabbing four data fields from the XML feed, only three of which we care about.

Once we know that we're going to have four lines of output, cyclically, we simply can decide which of those are actually important and tweak them in the awk statement:


$curl -u "davetaylor:$pw" $inurl | \
  grep -E '(<screen_name>|<text>|<id>)' | \
  sed 's/@DaveTaylor //;s/  <text>//;s/<\/text>//' | \
  sed 's/ *<screen_name>//;s/<\/screen_name>//' | \
  sed 's/ *<id>//;s/<\/id>//' | \
  awk '{ if (NR % 4 == 0) {
           printf ("name=%s; ", $0) }
         else if (NR % 4 == 1) {
           printf("id=%s; ",$0) }
         else if (NR % 4 == 2) {
           print "msg=\"" $0 "\"" }
       }' > $temp

That's a pretty complicated sequence, so let's look at the awk conditional statement a little closer. We have four input records (lines) that we're stepping through. The value of NR is the number of records processed so far. So if NR mod 4 equals 0, it's the first of the four records (lines). The first record is the name value.

Did you see that two lines have printf, and the third uses a simpler print statement? Since we want each set of variables on a separate line, we use the print statement, because it automatically appends a newline to the output. Of course, the same effect could be achieved by putting the newline as a format string passed to printf. Example output follows:

name=thattalldude; id=6507045947; msg="Rates?"
name=KateC; id=6507034680; msg="hours"
name=pbarbanes; id=6507033698; msg="thanks"
name=jodie_nodes; id=6507022063; msg=" $$?"
name=KateC; id=6507019757; msg="price"
name=tarahn; id=6507008559; msg="impact"
name=GaryH2UK; id=6507004771; msg="directions"

We're going to hand these again, line by line, to the eval statement to set the three variables: name, id and msg. Then, it's a simple parsing problem, comparing msg to the known queries we have. Basically, it's what we did last month, except this time, every single tweet also has a unique ID value associated with it.

A typical test might now look like this:

if [ "$msg" == "hours" ] ; then
  echo "@$name asked what our hours are in tweet $id"
fi

Nice! It's simple, straightforward and well worth the preprocessing hoops we've jumped through.

Working with IDs Included

Indeed, I run that against my Twitter stream (after asking people to send me sample queries), and here's what I see:

@TheNose100 asked what our hours are in tweet 6507436100
@crepeauf asked what our hours are in tweet 6507187325
@jdscott asked what our hours are in tweet 6507087136
@KateC asked what our hours are in tweet 6507034680
@inspiremetoday asked what our hours are in tweet 6506966654

I bet you can see how to proceed from here. We write static responses, calculate values as needed and use send-tweet to respond to the user:

$tweet "@$name our hours are Mon-Fri 9-5, Sat 10-4."

For fun, I'll let people send the query “time” and get the current output of the date command too, just to demonstrate how that might work. Here's the code block:

if [ "$msg" == "time" ] ; then
  echo "@$id asked for the time"
  $tweet "@$name the local time on our server is $(date)"
fi

Great. Got it all, except for where we started out. How do you track which tweets you've already answered?

______________________

Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState