At the Forge - Aggregating Syndication Feeds
The point of a news aggregator or other application that uses RSS and Atom is to collect and present newly updated information. An aggregator can show only the items that a server provides; if an RSS feed includes only the two most recently published items, then it becomes the aggregator's responsibility to poll, cache and display those items no longer being syndicated and summarized.
This raises two different but related questions: How can we ensure that our aggregator displays only items we have not seen yet? And is there a way for our aggregator to reduce the load on Weblog servers, retrieving only those items that were published since our last visit? Answering the first question requires looking at the modification date, if it exists, for each item.
The latter question has, as of this writing, been an increasingly popular issue of debate in the Web community. As a Weblog grows in popularity, the number of people who subscribe to its syndication feed also grows. If a Weblog has 500 subscribers to its syndication feed, and if each of these subscribers' aggregators look for updates each hour, that means an additional 500 requests per hour are made against a Web server. If the syndication feed provides the site's entire content, this can result in a great deal of wasted bandwidth—reducing the site's response time for other visitors and potentially forcing the site owner to pay for exceeding allocated monthly bandwidth.
feedparser allows us to be kind to syndicating servers and ourselves by providing a mechanism for retrieving a syndication feed only when there is something new to show. This is possible because modern versions of HTTP allow the requesting client to include an If-Modified-Since header, followed by a date. If the requested URL has changed since the date mentioned in the request, the server responds with the URL's content. But if the requested URL is unchanged, the server returns a 304 response code, indicating that the previously downloaded version remains the most current content.
We accomplish this by passing an optional modified parameter to our call to feedparser.parse(). This parameter is a standard, as defined by the time module, Python tuple in which the first six elements are the year, month number, day number, hour, minutes and seconds. The final three items don't concern us, and can be left as zeroes. So if I were interested in seeing feeds posted since September 1, 2004, I could say:
last_retrieval = (2004, 9, 1, 0, 0, 0, 0, 0, 0) ljfeed = feedparser.parse( "http://www.linuxjournal.com/news.rss")
If Linux Journal's server is configured well, the above code either results in ljfeed containing the complete syndication feed—returned with an HTTP OK status message, with a numeric code of 200--or an indication that the feed has not changed since its last retrieval, with a numeric code of 304. Although keeping track of the last time you requested a particular syndication feed might require more record-keeping on your part, it is important to do, especially if you requestfeed updates on a regular basis. Otherwise, you might find your application unwelcome at certain sites.
Now that we have a basic idea of how to work with feedparser, let's create a simple aggregation tool. This tool gets its input from a file called feeds.txt and produces its output in the form of an HTML file called feeds.html. Running this program by cron and looking at the resulting HTML file once per day provides a crude-but-working news feed from the sites that most interest you.
Feeds.txt contains URLs of actual feeds rather than of the sites from which we would like to get the feed. In other words, it's up to the user to find and enter the URL for each feed. More sophisticated aggregation tools usually are able to determine the feed's URL from a link tag in the header of the site's home page.
Also, despite my above warning that every news aggregator should keep track of its most recent request so as not to overwhelm servers, this program leaves out such features as part of my attempt to keep it small and readable.
The program, aggregator.py, can be read in Listing 1 and is divided into four parts:
We first open the output file, which is an HTML-formatted text file called myfeeds.html. The file is designed to be used from within a Web browser. If you are so inclined, you could add this local file, which has a file:/// URL, to your list of personal bookmarks or even make it your startup page. After making sure that we indeed can write to this file, we start the HTML file.
We then read the contents of feeds.txt, which contains one feed URL per line. In order to avoid problems with whitespace or blank lines, we strip off the whitespace and ignore any line without at least one printable character.
Next, we iterate over the list of feeds, feeds_list, invoking feedparser.parse() on that URL. When we receive a response, we write it to the output file, myfeeds.html, with both the URL and the title of the article.
Finally, we close the HTML and the file.
Practical books for the most technical people on the planet. Newly available books include:
- Agile Product Development by Ted Schmidt
- Improve Business Processes with an Enterprise Job Scheduler by Mike Diehl
- Finding Your Way: Mapping Your Network to Improve Manageability by Bill Childers
- DIY Commerce Site by Reven Lerner
Plus many more.