A Web-Based Clipping Service

Let LWP turn your web client into a midnight marauder.

In November, we saw how Perl's Library for Web Programming (LWP) can be used to create a simple HTTP client, retrieving one or more pages from the Web. This month, we will extend those efforts to create a program that can not only retrieve pages from the Web, but categorize them according to our preferences. In this way, we can create our own web-based clipping service, finding those articles that are of particular interest to us.

LWP consists of several modules which allow us to work with HTTP, the “hypertext transfer protocol”. HTTP works on a stateless request-response basis: a client connects to a server and submits a request. The server then generates a response, and closes the connection. (If you missed last month's column, it is available here: Working with LWP. You should read that article before continuing.)

Downloading Files

We need a program that will go to a particular URL and save the contents of that URL on disk. Furthermore, we want to follow any hyperlinks in that document to collect other news stories. However, we do not want to follow links to other sites; this not only reduces the chances that we will get sidetracked, but avoids the possibility of being led astray too much.

In other words, I would like to be able to point a program at a site and retrieve all of its files on to the disk. A first stab at such a program, download-recursively.pl, is similar to the simple robot program we explored last month. It uses two hashes, %already_retrieved and %to_be_retrieved, to store URLs. Rather than storing the URLs as values in the hash, we use them as keys. This ensures each URL will appear only once, avoiding infinite loops and miscounting. URLs are placed in %to_be_retrieved when they are first encountered, then moved to %already_retrieved after their contents are retrieved. $origin, a scalar variable that contains the initial URL, has a default setting if no argument is provided on the command line.

Retrievals are performed with a while loop. Each iteration of the while loop retrieves another URL from %to_be_retrieved, and uses it to create a new instance of HTTP::Request.

The method $response->last_modified returns the date and time on which a document was last modified. Subtracting $response->last_modified from the current time, and then comparing this result with the maximum age of documents we wish to see ($maximum_age) allows us to filter out relatively old documents:

my $document_age = time -
    if ($document_age > $maximum_age)
        print STDOUT
        "  Age of document: $document_age\n";

If the document is too old, we use next to return us to the next iteration of the while loop—and thus the next URL to be retrieved.

Next, we parse the contents of the HTTP response, using the HTML::LinkExtor module. When we create an instance of HTML::LinkExtor, we are actually creating a simple parser that can look through a page of HTML and return one or more pieces of information. The analysis is performed by a subroutine, often named callback. A reference to callback is passed along with the URL that will be parsed to create a new instance of HTML::LinkExtor.

my $parser = HTML::LinkExtor->new (\&callback, $url);

The resulting object can then parse our URL's contents by invoking:

When $parser->parse is invoked, &callback is invoked once for each HTML tag in the file. Our version of &callback grabs each URL in the file from the href attribute of each <a> tag, saving it in %to_be_retrieved unless it exists in %already_retrieved.

Finally, we save the retrieved document on the local file system. The tricky part of saving the file to disk has to do with the way in which we are retrieving the URLs—we are not traversing a tree of URLs, but are pulling URLs out in their hash order. This means the URL http://foo.com/a/b/c/ might be retrieved before http://foo.com/a/index.html. Thus, we need to ensure that the directory /a/b/c exists on our local system before /a and /a/b are created. How can we do this?

My solution was to use Perl's built-in split operator, which turns a scalar into a list. By assigning this list of partial directories into an array (@output_directory), we can sequentially build up the directory from the root (/) down to the final name. Along the way, we check to see if the directory exists. If it does not, we create the new directory with mkdir. If the directory does not exist and mkdir fails, we exit with a fatal error, indicating what error occurred.

Those URLs that lack a file name are given one of “index.html”. Since this is the default file name accessed on many web servers, it stands to reason this will probably not collide with any of those names.

The end result of running this program is a mirror of the downloaded site, stored in $output_directory.