At the Forge - Aggregating Syndication Feeds
The point of a news aggregator or other application that uses RSS and Atom is to collect and present newly updated information. An aggregator can show only the items that a server provides; if an RSS feed includes only the two most recently published items, then it becomes the aggregator's responsibility to poll, cache and display those items no longer being syndicated and summarized.
This raises two different but related questions: How can we ensure that our aggregator displays only items we have not seen yet? And is there a way for our aggregator to reduce the load on Weblog servers, retrieving only those items that were published since our last visit? Answering the first question requires looking at the modification date, if it exists, for each item.
The latter question has, as of this writing, been an increasingly popular issue of debate in the Web community. As a Weblog grows in popularity, the number of people who subscribe to its syndication feed also grows. If a Weblog has 500 subscribers to its syndication feed, and if each of these subscribers' aggregators look for updates each hour, that means an additional 500 requests per hour are made against a Web server. If the syndication feed provides the site's entire content, this can result in a great deal of wasted bandwidth—reducing the site's response time for other visitors and potentially forcing the site owner to pay for exceeding allocated monthly bandwidth.
feedparser allows us to be kind to syndicating servers and ourselves by providing a mechanism for retrieving a syndication feed only when there is something new to show. This is possible because modern versions of HTTP allow the requesting client to include an If-Modified-Since header, followed by a date. If the requested URL has changed since the date mentioned in the request, the server responds with the URL's content. But if the requested URL is unchanged, the server returns a 304 response code, indicating that the previously downloaded version remains the most current content.
We accomplish this by passing an optional modified parameter to our call to feedparser.parse(). This parameter is a standard, as defined by the time module, Python tuple in which the first six elements are the year, month number, day number, hour, minutes and seconds. The final three items don't concern us, and can be left as zeroes. So if I were interested in seeing feeds posted since September 1, 2004, I could say:
last_retrieval = (2004, 9, 1, 0, 0, 0, 0, 0, 0) ljfeed = feedparser.parse( "http://www.linuxjournal.com/news.rss")
If Linux Journal's server is configured well, the above code either results in ljfeed containing the complete syndication feed—returned with an HTTP OK status message, with a numeric code of 200--or an indication that the feed has not changed since its last retrieval, with a numeric code of 304. Although keeping track of the last time you requested a particular syndication feed might require more record-keeping on your part, it is important to do, especially if you requestfeed updates on a regular basis. Otherwise, you might find your application unwelcome at certain sites.
Now that we have a basic idea of how to work with feedparser, let's create a simple aggregation tool. This tool gets its input from a file called feeds.txt and produces its output in the form of an HTML file called feeds.html. Running this program by cron and looking at the resulting HTML file once per day provides a crude-but-working news feed from the sites that most interest you.
Feeds.txt contains URLs of actual feeds rather than of the sites from which we would like to get the feed. In other words, it's up to the user to find and enter the URL for each feed. More sophisticated aggregation tools usually are able to determine the feed's URL from a link tag in the header of the site's home page.
Also, despite my above warning that every news aggregator should keep track of its most recent request so as not to overwhelm servers, this program leaves out such features as part of my attempt to keep it small and readable.
The program, aggregator.py, can be read in Listing 1 and is divided into four parts:
We first open the output file, which is an HTML-formatted text file called myfeeds.html. The file is designed to be used from within a Web browser. If you are so inclined, you could add this local file, which has a file:/// URL, to your list of personal bookmarks or even make it your startup page. After making sure that we indeed can write to this file, we start the HTML file.
We then read the contents of feeds.txt, which contains one feed URL per line. In order to avoid problems with whitespace or blank lines, we strip off the whitespace and ignore any line without at least one printable character.
Next, we iterate over the list of feeds, feeds_list, invoking feedparser.parse() on that URL. When we receive a response, we write it to the output file, myfeeds.html, with both the URL and the title of the article.
Finally, we close the HTML and the file.
Practical Task Scheduling Deployment
July 20, 2016 12:00 pm CDT
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.Register Now!
- SUSE LLC's SUSE Manager
- My +1 Sword of Productivity
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- Non-Linux FOSS: Caffeine!
- Managing Linux Using Puppet
- Doing for User Space What We Did for Kernel Space
- Tech Tip: Really Simple HTTP Server with Python
- SuperTuxKart 0.9.2 Released
- Parsing an RSS News Feed with a Bash Script
- Rogue Wave Software's Zend Server
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide