Working with LWP

This month Mr. Lerner takes a look at the library for web programming and its associated modules.
Extracting Tags

Once we have retrieved the content from a web site, what can we do with it? As demonstrated above, we can print it out or play with the text. But many times, we want to analyze the tags in the document, picking out the images, the hyperlinks or even the headlines.

In order to do this, we could use regular expressions and m//, Perl's matching operator. But an easier way is to use HTML::LinkExtor, another object that is designed for this purpose. Once we create an instance of HTML::Extor, we can then use the parse method to retrieve each of the tags in it.

HTML::LinkExtor works differently from many modules you might have used before, in that it uses a “callback”. In this case, a callback is a subroutine defined to take two arguments—a scalar containing the name of the tag and a hash containing the name,value pairs associated with that tag. The subroutine is invoked each time HTML::LinkExtor finds a tag.

For example, given the HTML

<input type="text" value="Reuven"
     name="first_name" size="5">

our callback would have to be prepared to handle a scalar of value input, with a hash that looks like

(type => "text", value => "Reuven",
     name => "first_name", size => "5")
Listing 2

If we are interested in printing the various HTML tags to the screen, we could write a simple callback that looks like Listing 2. How do we tell HTML::LinkExtor to invoke our callback subroutine each time it finds a match? The easiest way would be for us to hand callback to the parse method as an argument.

Perl allows us to pass subroutines and other blocks of code as if they were data by creating references to them. A reference looks and acts like a scalar, except that it can be turned into something else. Perl has scalar, array and hash references; subroutine references fit naturally into this picture as well. HTML::LinkExtor will dereference and use our subroutine as we have defined it.

We turn a subroutine into a subroutine reference by prefacing its name with \&. Perl 5 no longer requires that you put & before subroutine names, but it is required when you are passing a subroutine reference. The backslash tells Perl we want to turn the object in question into a reference. If &callback is defined as above, then we can print out all of the links in a document with the following:

my $parser = HTML::LinkExtor->new(\&callback);
$parser->parse($response->content);

Note that $content might have all HTML links that were returned with the HTTP response. However, that response undoubtedly contained some relative URLs, which will not be interpreted correctly out of context. How can we accurately view the link?

HTML::LinkExtor takes that into account, and allows us to pass two arguments to its constructor (new), rather than just one. The second argument, which is optional, is the URL from which we received this content. Passing this URL ensures all URLs we extract will be complete. We must include the line

use URI::URL;

in our application if we want to use this feature. We can then say

my $parser = HTML::LinkExtor->new(\&callback,
                       "http://www.lerner.co.il/");
    $parser->parse($response->content);
and our callback will be invoked for each tag, with a full, absolute URL even if the document contains a relative one.

Our version of &callback above prints out all links, not just hyperlinks. We can ignore all but “anchor” tags, which allow us to create hyperlinks by modifying &callback slightly, as shown in Listing 3.

Listing 3

A Small Demonstration

With all of this under our belts, we will write an application (Listing 4) that follows links recursively until the program is stopped. This sort of program can be useful for checking links on your site or harvesting information from documents.

Listing 4

Our program, download-recursively.pl, starts at the URL called $origin and collects the URLs contained within it, placing them in the hash %to_be_retrieved. It then goes through each of those URLs one by one, collecting any hyperlinks that might be contained within them. Each time it retrieves a URL, download-recursively.pl places it in %already_retrieved. This ensures we will not download the same URL twice.

We create $ua, our instance of LWP::RobotUA, outside of the “while” loop. After all, our HTTP requests and responses will be changing with each loop iteration, but our user agent can remain the same throughout the life of the program.

We go through each URL in %to_be_retrieved in a seemingly random order, taking the first item returned by keys. It is obviously possible to sort the keys before taking the first element from the resulting list or to do a depth-first or breadth-first search through the list of URLs.

Inside the loop, the code is as we might expect: we create a new instance of HTTP:Request and pass it to $ua, receiving a new instance of HTTP:Response in return. Then we parse the response content with HTML::LinkExtor, putting each new URL in %to_be_retrieved, but only on the condition that it is not already a key in %already_retrieved.

You may find it interesting to let this program go for a while, following links from one of your favorite sites. The Web is all about linking; see who is linking to whom. You might be surprised by what you find.

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix