At the Forge - Aggregating with Atom
In the world of organized crime, a syndicate is a collection of gangsters who work together. In the world of newspapers, a syndicate distributes information to subscribers, allowing each publication to tailor the content of information it receives. Comics, news stories and opinion columns often are distributed by syndicates, providing greater exposure for the authors and more content for the readers.
In the past few years, Web developers also have begun to use the term syndicate, as both a verb and a noun. Fortunately for our safety, syndication on the Web has more in common with newspapers than with the mob. But as with organized crime, many people have been hurt in public disputes (albeit with words, not guns), leading to a split and a fair amount of acrimony in the world of Web syndication.
The result of this split is Atom, a new syndication format that has much in common with RSS (rich site summary or RDF site summary, depending on the version and whom you ask). I believe that Atom offers a number of advantages over any version of RSS, and that the simplicity with which Atom feeds can be created makes it an obvious choice over RSS. That said, the fact that most Weblog products provide RSS feeds means that the two camps happily can coexist for now. Understanding how both work also means your organization can decide to adopt one or both standards, depending on your needs.
As we saw last month, RSS really is two different formats, or more precisely, two different families of formats. RSS 0.9x and RSS 2.0 are from the same family and demonstrate the evolution, over time, of syndication on the Web. RSS 2.0 is maintained mainly by Dave Winer of Userland, scripting.com and (most recently) Harvard University. Winer has given ownership of the standard to Harvard but also has declared that version 2.0 will be the final one. Nevertheless, the combination of RSS 0.9x and RSS 2.0 represents a widespread, stable, well-understood and ambiguous protocol for syndicating Web content.
A separate flavor of RSS, confusingly known as RSS 1.0, uses the resource development framework (RDF) produced by the World Wide Web Consortium (W3C). RDF is designed to make it possible for computers to understand a site's contents, allowing it to make connections between sites, much as people instinctively do all the time. RSS 1.0 produces a summary that is incompatible with all other versions of RSS, using RDF to produce a standardized description of the site's contents.
The fact that RSS 1.0 used the RSS name caused a great deal of friction and animosity, with many people variously blaming Dave Winer, the vagueness of the RSS specification and the proponents of Atom's predecessor. At the end of the day, a number of prominent individuals—led by Tim Bray, Mark Pilgrim and Sam Ruby—were backed by such companies as Six Degrees (which publishes Movable Type software for Weblogs) to produce a specification, initially called PIE and Echo, which attempts to address the shortcomings of RSS.
The development of Atom took some time, because it involved understanding and defining exactly what syndication means on today's World Wide Web. RSS no longer is used only for news sites, its original target, but also for Weblogs and nontextual content. The developers decided to make internationalization a top priority, meaning that it should be possible to produce a syndication feed in any language. Another priority was the development of extensions—that is, it should be possible to add new functionality to the Atom feed without having to redefine the core Atom specification.
As of this writing (mid-August 2004), the Atom specification now exists in version 0.3, along with a standard API for editing content over the network. Atom has begun the process of becoming standardized by the IETF (the Internet Engineering Task Force, which produces and publishes Internet standards), meaning it is on its way to being a universally accepted standard, much like TCP/IP, SMTP or HTTP. This undoubtedly will lead to even greater interest in Atom from organizations that wait for the IETF's stamp of approval.
Atom is still in its initial stages, lacking public specifications for a number of items, such as its extension mechanism. But its authors have, to date, produced a standard whose complexity is fairly close to RSS 0.9x and 2.0, written in as unambiguous a fashion as possible, which includes many members of the Web syndication community and offers a vision of syndication that goes far beyond the Web.
Although RSS was designed to summarize a news feed or Weblog, Atom was created with a more general purpose in mind. For example, factory machines could produce status reports in Atom, with an aggregator displaying those that are malfunctioning. Libraries could produce Atom feeds of the latest additions to their collections, with smart aggregators looking for books on certain subjects. Fax machines could be replaced by fax modems, using Atom to distribute fax images to appropriate groups of people.
You even could use Atom feeds to create a newspaper publishing system, where reporters send their stories not as e-mail, but instead publish drafts on an Atom feed. Each editor would aggregate Atom feeds from the reporters under his or her control, moving them onto an outgoing Atom feed when the editing was complete. The final feed would end up in the production department, where the text would be laid out and made ready for actual printing. The newspaper's content flow thus would be a flow of many Atom feeds into a single, final feed representing the newspaper itself.
Producing an Atom feed is fairly simple, if you use Perl or another high-level language for which an Atom library exists. Perl, for example, has the XML::Atom module, available from CPAN (Comprehensive Perl Archive Network). I had a bit of trouble installing XML::Atom on my machine running Fedora Core 2 and Perl 5.8.3, but I was able to work around it by ignoring the optional DateTime module during the installation process. I would not recommend doing so in a production environment.
Although XML::Atom is the overall package name, programs that create Atom feeds actually use XML::Atom::Feed and XML::Atom::Entry. Here is a short Perl program that produces a simple feed, based in part on the sample program in the perldoc on-line documentation for XML::Atom::Feed:
#!/usr/bin/perl use strict; use diagnostics; use warnings; use XML::Atom::Feed; use XML::Atom::Entry; # Create a new Atom feed my $feed = XML::Atom::Feed->new; $feed->title('My Weblog'); my $entry; # Create a first entry for the feed $entry = XML::Atom::Entry->new; $entry->title('First Post'); $entry->content('First Post Body'); $feed->add_entry($entry); # Create a second entry for the feed $entry = XML::Atom::Entry->new; $entry->title('Second Post'); $entry->content('Second Post Body'); $feed->add_entry($entry); # Now produce the XML output my $atom_feed_xml = $feed->as_xml; # Display the XML output print $atom_feed_xml, "\n";
The above program produces the following feed, which I have formatted with extra whitespace for easier reading:
<?xml version="1.0"?> <feed xmlns="http://purl.org/atom/ns#"> <title> My Weblog </title> <entry > <title> First Post </title> <content mode="xml"> <default:div xmlns="http://www.w3.org/1999/xhtml"> First Post Body </default:div> </content> </entry> <entry > <title> Second Post </title> <content mode="xml"> <default:div xmlns="http://www.w3.org/1999/xhtml"> Second Post Body </default:div> </content> </entry> </feed>
As you can see, we create a single XML::Atom::Feed object, containing one or more instances of XML::Atom::Entry. Each entry object corresponds to a single <entry> tag in the Atom feed, which in turn represents a single entry in our Weblog or a single message from our factory floor.
The Atom specification indicates that the feed may contain a number of attributes and sub-elements, including a language, a description of the Weblog or site, copyright information and other general information about the originating site. Each entry, in turn, has its own set of elements, such as a title, an indication of when it was created and a summary. Each Atom element also has a MIME type indicating what type of content it contains, much like HTTP responses and e-mail attachments.
Of course, creating a feed, as in the above example, is necessary only if you are writing a new Atom-powered application or if you are adding Atom capabilities to a Weblog product. Most Weblog products now provide Atom feeds, either as part of their standard distribution or through a plugin or other extension mechanism. For example, an Atom feed plugin for the Blosxom Weblog product makes it easy to add such a feed from a Weblog; install the plugin (by placing it in the plugins directory), and anyone interested in receiving an Atom feed from the Weblog in question will be able to do so.
It shouldn't come as a surprise that this is so easy to accomplish, given the fact that Blosxom is written in Perl, that Perl provides excellent tools for working with XML and that the plugin simply needs to summarize and rewrite content from the most recent entries in the Weblog. Because Blosxom makes it so easy for plugins to modify the main page (so as to advertise the Atom feed) and to retrieve content (through the plugin API), it might be slightly easier to work with Atom from that product. Given that most Weblog products are written in a high-level language, such as Perl, Python or PHP, it should be easy to add an Atom feed where none currently exists.
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- Numerical Python
- Use Linux as a SAN Provider
- diff -u: What's New in Kernel Development
- NSA: Linux Journal is an "extremist forum" and its readers get flagged for extra surveillance
- RSS Feeds
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- Tech Tip: Really Simple HTTP Server with Python