At the Forge - Aggregating Syndication Feeds

 in
So far, we have looked at ways in which people might create RSS and Atom feeds for a Web site. Of course, creating syndication feeds is only one half of the equation. Equally as important and perhaps even more useful is understanding how we can retrieve and use syndication feeds, both from our own sites and from other sites of interest.

As you can see from looking at the code listing, creating such a news aggregator for personal use is fairly simple and straightforward. This is merely a skeletal application, however. To be more useful in the real world, we probably would want to move feeds.txt and myfeeds.html into a relational database, determine the feed URL automatically or semi-automatically based on a site URL and handle categories of feeds, so that multiple feeds can be read as if they were one.

If the above description sounds familiar, then you might be a user of Bloglines.com, a Web-based blog aggregator that probably works in the above way. Obviously, Bloglines handles many more feeds and many more users than we had in this simple toy example. But, if you are interested in creating an internal version of Bloglines for your organization, the combination of the Universal Feed Parser with a relational database, such as PostgreSQL, and some personalization code is both easy to implement and quite useful.

Conclusion

The tendency to reinvent the wheel often is cited as a widespread problem in the computer industry. Mark Pilgrim's Universal Feed Parser might fill only a small need in the world of software, but that need is almost certain to grow as the use of syndication increases for individuals and organizations alike. The bottom line is if you are interested in reading and parsing syndication feeds, you should be using feedparser. It is heavily tested and documented, often updated and improved and it does its job quickly and well.

Reuven M. Lerner, a longtime Web/database consultant and developer, now is a graduate student in the Learning Sciences program at Northwestern University. His Weblog is at altneuland.lerner.co.il, and you can reach him at reuven@lerner.co.il.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

a small semantic error in aggregator.py

zied's picture

Hi,
In aggregator.py, instead of the feed's title there's the first feed title :
aggregation_file.write('%s\n' % \
feed.entries[0].title)

I would suggest you this :
aggregation_file.write('%s\n' % \
feed.channel.title)

bye

Share what you learn what you don't

Install error

midijery's picture

I came up with an error also. I'm running SUSE 9.1 and on installing as per instructions came up with anerror:
No module named distutils.core
Ive been trying to work with Linux for many years and it's getting much more user freindly, but coming up with errors like this only lead to frustration.

Not so simple install

maskedfrog's picture

I can't speak for other distro's but on Mandrake 10.1 and likely previous versions libpython2.x-devel must be installed not just python.

Installing feedparser is extremely simple. Download the latest version, move into its distribution directory and type
python setup.py install.
This activates Python's standard installation utility, placing the feedparser in your Python site-packages directory. Once you have done installed feedparser, you can test it using Python interactively, from a shell window:

This will quickly result in feedback of:

error: invalid Python installation: unable to open
/usr/lib/python2.3/config/Makefile (No such file or directory)

or similar unless libpythonX.x-devel is installed.
Apparently this applies to RedHat fedora also.

Other than that, haven't checked into the code sample from the first reply, this is a fine article that I hope will get me started on my own personal aggregator so I can replace Knewsticker with a robust and site friendly aggregator. And not get banned at /. again (-:

Download link, and example code typo

nathanst's picture

The article doesn't seem to actually say where feedparser can be downloaded from (and there is no "resources" link for this article). Presumably this is the site in question:
http://www.feedparser.org/

Also, in the How New Is that News? section, it looks like the code snippet is actually missing the "modified" parameter in the function call. I think those lines should be:


last_retrieval = (2004, 9, 1, 0, 0, 0, 0, 0, 0)
ljfeed = feedparser.parse("http://www.linuxjournal.com/news.rss",
              modified=last_retrieval )

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix