Working with LWP

This month Mr. Lerner takes a look at the library for web programming and its associated modules.
Extracting Tags

Once we have retrieved the content from a web site, what can we do with it? As demonstrated above, we can print it out or play with the text. But many times, we want to analyze the tags in the document, picking out the images, the hyperlinks or even the headlines.

In order to do this, we could use regular expressions and m//, Perl's matching operator. But an easier way is to use HTML::LinkExtor, another object that is designed for this purpose. Once we create an instance of HTML::Extor, we can then use the parse method to retrieve each of the tags in it.

HTML::LinkExtor works differently from many modules you might have used before, in that it uses a “callback”. In this case, a callback is a subroutine defined to take two arguments—a scalar containing the name of the tag and a hash containing the name,value pairs associated with that tag. The subroutine is invoked each time HTML::LinkExtor finds a tag.

For example, given the HTML

<input type="text" value="Reuven"
     name="first_name" size="5">

our callback would have to be prepared to handle a scalar of value input, with a hash that looks like

(type => "text", value => "Reuven",
     name => "first_name", size => "5")
Listing 2

If we are interested in printing the various HTML tags to the screen, we could write a simple callback that looks like Listing 2. How do we tell HTML::LinkExtor to invoke our callback subroutine each time it finds a match? The easiest way would be for us to hand callback to the parse method as an argument.

Perl allows us to pass subroutines and other blocks of code as if they were data by creating references to them. A reference looks and acts like a scalar, except that it can be turned into something else. Perl has scalar, array and hash references; subroutine references fit naturally into this picture as well. HTML::LinkExtor will dereference and use our subroutine as we have defined it.

We turn a subroutine into a subroutine reference by prefacing its name with \&. Perl 5 no longer requires that you put & before subroutine names, but it is required when you are passing a subroutine reference. The backslash tells Perl we want to turn the object in question into a reference. If &callback is defined as above, then we can print out all of the links in a document with the following:

my $parser = HTML::LinkExtor->new(\&callback);

Note that $content might have all HTML links that were returned with the HTTP response. However, that response undoubtedly contained some relative URLs, which will not be interpreted correctly out of context. How can we accurately view the link?

HTML::LinkExtor takes that into account, and allows us to pass two arguments to its constructor (new), rather than just one. The second argument, which is optional, is the URL from which we received this content. Passing this URL ensures all URLs we extract will be complete. We must include the line

use URI::URL;

in our application if we want to use this feature. We can then say

my $parser = HTML::LinkExtor->new(\&callback,
and our callback will be invoked for each tag, with a full, absolute URL even if the document contains a relative one.

Our version of &callback above prints out all links, not just hyperlinks. We can ignore all but “anchor” tags, which allow us to create hyperlinks by modifying &callback slightly, as shown in Listing 3.

Listing 3

A Small Demonstration

With all of this under our belts, we will write an application (Listing 4) that follows links recursively until the program is stopped. This sort of program can be useful for checking links on your site or harvesting information from documents.

Listing 4

Our program,, starts at the URL called $origin and collects the URLs contained within it, placing them in the hash %to_be_retrieved. It then goes through each of those URLs one by one, collecting any hyperlinks that might be contained within them. Each time it retrieves a URL, places it in %already_retrieved. This ensures we will not download the same URL twice.

We create $ua, our instance of LWP::RobotUA, outside of the “while” loop. After all, our HTTP requests and responses will be changing with each loop iteration, but our user agent can remain the same throughout the life of the program.

We go through each URL in %to_be_retrieved in a seemingly random order, taking the first item returned by keys. It is obviously possible to sort the keys before taking the first element from the resulting list or to do a depth-first or breadth-first search through the list of URLs.

Inside the loop, the code is as we might expect: we create a new instance of HTTP:Request and pass it to $ua, receiving a new instance of HTTP:Response in return. Then we parse the response content with HTML::LinkExtor, putting each new URL in %to_be_retrieved, but only on the condition that it is not already a key in %already_retrieved.

You may find it interesting to let this program go for a while, following links from one of your favorite sites. The Web is all about linking; see who is linking to whom. You might be surprised by what you find.