At the Forge - Bloglines Web Services, Continued
I am writing this column a few days after the November 2, 2004, elections in the United States. As an admitted political junkie, I enjoy the modern era of computerized, always-on punditry. No longer must I switch TV stations or read several newspapers at the local library; now, I can follow the sound bites as they pass from the candidates to the press to the various partisan sites.
Keeping up with many different news and opinion sites can consume quite a bit of time. As we have seen over the last few months, everyone has benefited from the creation of news aggregators—programs that read the RSS and Atom syndication feeds produced by Weblogs, newspapers and other frequently updated sites. An aggregator, as its name suggests, takes these feeds and puts them into a single, easily accessible listing.
Bloglines.com is an Internet startup that provides a Web-based news aggregator. In and of itself, this should not surprise anyone; the combination of syndication, aggregation and the Web made this a natural idea. And, Bloglines isn't unique; there are other, perhaps lesser-known, Web-based news aggregators.
One unique service that Bloglines offers its subscribers, however, is the ability to use Bloglines' internal database to create their own news aggregators or their own applications built from the data Bloglines has collected. This information is available without charge, under a fairly unrestrictive license, to any programmer interested in harvesting the results of Bloglines' engine. The fact that Bloglines checks for updates on hundreds of thousands of blogs and sites approximately every hour means that someone using the Web services API can be assured of getting the most recent Weblog content.
Last time [LJ, January 2005], we looked at the Notifier API, which provides access to a particular user's available-but-unread feeds. We also discussed the Blogroll API, which allows users to determine and use programmatically, if they wish, a list of people who are pointing to a feed. As we saw, these APIs made it easy for us to find out that new Weblog entries were available or to create our own custom aggregation page listing Weblogs of interest.
Something was missing in the functionality that we exposed in that article, however. It's nice to know that new Weblog entries are among my Bloglines subscriptions, but it would be even nicer to know which blogs had been updated. And, it's nice to get a list of my current subscriptions, but I would be much happier to find out which of them have been updated—and to find out when they were most recently updated, how many new entries are in each Weblog and what those entries contain. In other words, I want to be able to replace the current Bloglines interface with one of my own, displaying new Weblog entries in a format that isn't dictated by the Bloglines.com Web site.
Luckily, the Web services developers at Bloglines have made it possible to do exactly this by way of the sync API. This month, we continue our exploration of Bloglines Web services, looking in detail at the sync API it provides. We also are going to create a simple news aggregator of our own, providing some of the same features as the Bloglines interface.
At the end of the day, a news aggregator such as Bloglines simply is a list of URLs. Indeed, the Python-based news aggregator we created two months ago using the Universal Feed Parser was precisely such a program—it looked at a set of URLs in a file and retrieved the most recent items associated with those URLs. Each individual Weblog posting must be associated with one of the URLs on a list. Removing a URL from the subscription lists makes its associated postings irrelevant to that user and invisible to them.
The fact that Bloglines has multiple users rather than a single user means it must keep track of not only a set of different URLs, but also which URL is associated with each user. Although this obviously complicates things somewhat, modern high-level languages make the difference between these two data structures easily understood. Rather than simply storing a list of URLs, we must create a hash table, in which the key is a user ID and the value is the list associated with that particular user. Once we have the user's unique ID, we easily can keep track of that particular user's subscriptions.
Of course, Bloglines is keeping track of subscriptions not for a few thousand users, but for many tens or hundreds of thousands of users. Thus, it is safe to assume they are not using such a naive implementation, which would suffice for a small experiment or an aggregator designed for a small number of people. Things get a bit trickier when you approach Bloglines' user load. Each user's list of subscriptions can't be a simple URL; it is more likely to be an ID number (or primary key, in database jargon) associated with a URL. Such a system gives multiple participants the chance to subscribe to a site's syndication feed and allows Bloglines to suggest new Weblogs that they might enjoy, based on their current subscriptions.
It thus should come as no surprise to learn that retrieving new Weblog postings from Bloglines is a two-step process, with the first step requiring us to retrieve a list of subscriptions. That is, we first ask Bloglines for a list of subscription IDs associated with a user. We then ask Bloglines to send us all of the new items for this user and this subscription ID.
Implementations of the Bloglines Web services API are available in several different languages. Because Perl is my default language for creating new applications, I am going to use the WebService::Bloglines module that has been uploaded to CPAN, the Comprehensive Perl Archive Network, a worldwide collection of Web and FTP servers from which Perl and its modules can be retrieved. For example, Listing 1 contains a simple program (bloglines-listsubs.pl) that displays the title, subscription ID and URL for each of a user's subscriptions. A number of additional values are available for each of the subscriptions; the documentation for WebService::Bloglines, as well as the Bloglines API documentation, lists these in detail.