At the Forge - Aggregating Syndication Feeds

 in
So far, we have looked at ways in which people might create RSS and Atom feeds for a Web site. Of course, creating syndication feeds is only one half of the equation. Equally as important and perhaps even more useful is understanding how we can retrieve and use syndication feeds, both from our own sites and from other sites of interest.
How New Is that News?

The point of a news aggregator or other application that uses RSS and Atom is to collect and present newly updated information. An aggregator can show only the items that a server provides; if an RSS feed includes only the two most recently published items, then it becomes the aggregator's responsibility to poll, cache and display those items no longer being syndicated and summarized.

This raises two different but related questions: How can we ensure that our aggregator displays only items we have not seen yet? And is there a way for our aggregator to reduce the load on Weblog servers, retrieving only those items that were published since our last visit? Answering the first question requires looking at the modification date, if it exists, for each item.

The latter question has, as of this writing, been an increasingly popular issue of debate in the Web community. As a Weblog grows in popularity, the number of people who subscribe to its syndication feed also grows. If a Weblog has 500 subscribers to its syndication feed, and if each of these subscribers' aggregators look for updates each hour, that means an additional 500 requests per hour are made against a Web server. If the syndication feed provides the site's entire content, this can result in a great deal of wasted bandwidth—reducing the site's response time for other visitors and potentially forcing the site owner to pay for exceeding allocated monthly bandwidth.

feedparser allows us to be kind to syndicating servers and ourselves by providing a mechanism for retrieving a syndication feed only when there is something new to show. This is possible because modern versions of HTTP allow the requesting client to include an If-Modified-Since header, followed by a date. If the requested URL has changed since the date mentioned in the request, the server responds with the URL's content. But if the requested URL is unchanged, the server returns a 304 response code, indicating that the previously downloaded version remains the most current content.

We accomplish this by passing an optional modified parameter to our call to feedparser.parse(). This parameter is a standard, as defined by the time module, Python tuple in which the first six elements are the year, month number, day number, hour, minutes and seconds. The final three items don't concern us, and can be left as zeroes. So if I were interested in seeing feeds posted since September 1, 2004, I could say:

last_retrieval = (2004, 9, 1, 0, 0, 0, 0, 0, 0)
ljfeed = feedparser.parse(
         "http://www.linuxjournal.com/news.rss")

If Linux Journal's server is configured well, the above code either results in ljfeed containing the complete syndication feed—returned with an HTTP OK status message, with a numeric code of 200--or an indication that the feed has not changed since its last retrieval, with a numeric code of 304. Although keeping track of the last time you requested a particular syndication feed might require more record-keeping on your part, it is important to do, especially if you requestfeed updates on a regular basis. Otherwise, you might find your application unwelcome at certain sites.

Working with Feeds

Now that we have a basic idea of how to work with feedparser, let's create a simple aggregation tool. This tool gets its input from a file called feeds.txt and produces its output in the form of an HTML file called feeds.html. Running this program by cron and looking at the resulting HTML file once per day provides a crude-but-working news feed from the sites that most interest you.

Feeds.txt contains URLs of actual feeds rather than of the sites from which we would like to get the feed. In other words, it's up to the user to find and enter the URL for each feed. More sophisticated aggregation tools usually are able to determine the feed's URL from a link tag in the header of the site's home page.

Also, despite my above warning that every news aggregator should keep track of its most recent request so as not to overwhelm servers, this program leaves out such features as part of my attempt to keep it small and readable.

The program, aggregator.py, can be read in Listing 1 and is divided into four parts:

  1. We first open the output file, which is an HTML-formatted text file called myfeeds.html. The file is designed to be used from within a Web browser. If you are so inclined, you could add this local file, which has a file:/// URL, to your list of personal bookmarks or even make it your startup page. After making sure that we indeed can write to this file, we start the HTML file.

  2. We then read the contents of feeds.txt, which contains one feed URL per line. In order to avoid problems with whitespace or blank lines, we strip off the whitespace and ignore any line without at least one printable character.

  3. Next, we iterate over the list of feeds, feeds_list, invoking feedparser.parse() on that URL. When we receive a response, we write it to the output file, myfeeds.html, with both the URL and the title of the article.

  4. Finally, we close the HTML and the file.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

a small semantic error in aggregator.py

zied's picture

Hi,
In aggregator.py, instead of the feed's title there's the first feed title :
aggregation_file.write('%s\n' % \
feed.entries[0].title)

I would suggest you this :
aggregation_file.write('%s\n' % \
feed.channel.title)

bye

Share what you learn what you don't

Install error

midijery's picture

I came up with an error also. I'm running SUSE 9.1 and on installing as per instructions came up with anerror:
No module named distutils.core
Ive been trying to work with Linux for many years and it's getting much more user freindly, but coming up with errors like this only lead to frustration.

Not so simple install

maskedfrog's picture

I can't speak for other distro's but on Mandrake 10.1 and likely previous versions libpython2.x-devel must be installed not just python.

Installing feedparser is extremely simple. Download the latest version, move into its distribution directory and type
python setup.py install.
This activates Python's standard installation utility, placing the feedparser in your Python site-packages directory. Once you have done installed feedparser, you can test it using Python interactively, from a shell window:

This will quickly result in feedback of:

error: invalid Python installation: unable to open
/usr/lib/python2.3/config/Makefile (No such file or directory)

or similar unless libpythonX.x-devel is installed.
Apparently this applies to RedHat fedora also.

Other than that, haven't checked into the code sample from the first reply, this is a fine article that I hope will get me started on my own personal aggregator so I can replace Knewsticker with a robust and site friendly aggregator. And not get banned at /. again (-:

Download link, and example code typo

nathanst's picture

The article doesn't seem to actually say where feedparser can be downloaded from (and there is no "resources" link for this article). Presumably this is the site in question:
http://www.feedparser.org/

Also, in the How New Is that News? section, it looks like the code snippet is actually missing the "modified" parameter in the function call. I think those lines should be:


last_retrieval = (2004, 9, 1, 0, 0, 0, 0, 0, 0)
ljfeed = feedparser.parse("http://www.linuxjournal.com/news.rss",
              modified=last_retrieval )

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix