Working with LWP

This month Mr. Lerner takes a look at the library for web programming and its associated modules.
Extracting Tags

Once we have retrieved the content from a web site, what can we do with it? As demonstrated above, we can print it out or play with the text. But many times, we want to analyze the tags in the document, picking out the images, the hyperlinks or even the headlines.

In order to do this, we could use regular expressions and m//, Perl's matching operator. But an easier way is to use HTML::LinkExtor, another object that is designed for this purpose. Once we create an instance of HTML::Extor, we can then use the parse method to retrieve each of the tags in it.

HTML::LinkExtor works differently from many modules you might have used before, in that it uses a “callback”. In this case, a callback is a subroutine defined to take two arguments—a scalar containing the name of the tag and a hash containing the name,value pairs associated with that tag. The subroutine is invoked each time HTML::LinkExtor finds a tag.

For example, given the HTML

<input type="text" value="Reuven"
     name="first_name" size="5">

our callback would have to be prepared to handle a scalar of value input, with a hash that looks like

(type => "text", value => "Reuven",
     name => "first_name", size => "5")
Listing 2

If we are interested in printing the various HTML tags to the screen, we could write a simple callback that looks like Listing 2. How do we tell HTML::LinkExtor to invoke our callback subroutine each time it finds a match? The easiest way would be for us to hand callback to the parse method as an argument.

Perl allows us to pass subroutines and other blocks of code as if they were data by creating references to them. A reference looks and acts like a scalar, except that it can be turned into something else. Perl has scalar, array and hash references; subroutine references fit naturally into this picture as well. HTML::LinkExtor will dereference and use our subroutine as we have defined it.

We turn a subroutine into a subroutine reference by prefacing its name with \&. Perl 5 no longer requires that you put & before subroutine names, but it is required when you are passing a subroutine reference. The backslash tells Perl we want to turn the object in question into a reference. If &callback is defined as above, then we can print out all of the links in a document with the following:

my $parser = HTML::LinkExtor->new(\&callback);
$parser->parse($response->content);

Note that $content might have all HTML links that were returned with the HTTP response. However, that response undoubtedly contained some relative URLs, which will not be interpreted correctly out of context. How can we accurately view the link?

HTML::LinkExtor takes that into account, and allows us to pass two arguments to its constructor (new), rather than just one. The second argument, which is optional, is the URL from which we received this content. Passing this URL ensures all URLs we extract will be complete. We must include the line

use URI::URL;

in our application if we want to use this feature. We can then say

my $parser = HTML::LinkExtor->new(\&callback,
                       "http://www.lerner.co.il/");
    $parser->parse($response->content);
and our callback will be invoked for each tag, with a full, absolute URL even if the document contains a relative one.

Our version of &callback above prints out all links, not just hyperlinks. We can ignore all but “anchor” tags, which allow us to create hyperlinks by modifying &callback slightly, as shown in Listing 3.

Listing 3

A Small Demonstration

With all of this under our belts, we will write an application (Listing 4) that follows links recursively until the program is stopped. This sort of program can be useful for checking links on your site or harvesting information from documents.

Listing 4

Our program, download-recursively.pl, starts at the URL called $origin and collects the URLs contained within it, placing them in the hash %to_be_retrieved. It then goes through each of those URLs one by one, collecting any hyperlinks that might be contained within them. Each time it retrieves a URL, download-recursively.pl places it in %already_retrieved. This ensures we will not download the same URL twice.

We create $ua, our instance of LWP::RobotUA, outside of the “while” loop. After all, our HTTP requests and responses will be changing with each loop iteration, but our user agent can remain the same throughout the life of the program.

We go through each URL in %to_be_retrieved in a seemingly random order, taking the first item returned by keys. It is obviously possible to sort the keys before taking the first element from the resulting list or to do a depth-first or breadth-first search through the list of URLs.

Inside the loop, the code is as we might expect: we create a new instance of HTTP:Request and pass it to $ua, receiving a new instance of HTTP:Response in return. Then we parse the response content with HTML::LinkExtor, putting each new URL in %to_be_retrieved, but only on the condition that it is not already a key in %already_retrieved.

You may find it interesting to let this program go for a while, following links from one of your favorite sites. The Web is all about linking; see who is linking to whom. You might be surprised by what you find.

______________________

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState