Using Perl to Check Web Links

Do you have many links on your web pages? If so, you're probably finding that the pages at the ends of those links are disappearing faster than you can track them. Ah, but now you can get the computer to handle that for you.

One of the first things I did when I got my first Internet account was put together my own set of web pages. The one I get the most comments about is called “Weirichs on the Web” where I link to other Weirichs I have found on the Web. Although a lot of fun, keeping the links up to date can be very tedious. As web pages that I reference are moved or deleted, links to them become stale. Without constant checking, it is difficult to keep my links current.

So, I began to wonder, is there a way to automatically find the outdated links in a web page? What I needed was a script that would scan all of my web pages and report every bad HTML link along with the web page on which it was used.

There are several parts to this problem. Our script must be able to:

  • fetch a web document from the Web

  • extract a list of URLs from a web document

  • test a URL to see if it is valid

The LWP Library

We could write code by hand to extract URLs and validate them, but there is a much easier way. LWP is a Perl library (available from any CPAN archive site) designed to make accessing the World Wide Web very easy in Perl. LWP uses Perl objects to provide Web-related services to a client. Perl objects are a recent addition to the Perl language and many people might not be familiar with them.

Perl objects are references to “things” that know what class they belong to. These “things” are usually anonymous hashes but you don't need to know this to use an object. Classes are packages that provide the methods the object uses to implement its behavior. And finally, a method is a function (in the class package) that expects an object reference (or sometimes a package name) as its first argument.

If this sounds confusing, don't worry. Using objects is very easy. LWP defines a class called HTTP::Request that represents a request to be sent on the Web. The request to GET a document at URL http://w3.one.net/~jweirich can be created with the statement:

$req = new HTTP::Request GET,
 'http://w3.one.net/~jweirich';

new creates a new Request object initialized with the GET and http://w3.one.net/~jweirich parameters. This new object is assigned to the $req variable.

Calling a member function of an object is equally straightforward. For example, if you want to examine the URL for this request, you can invoke the url method on this object.

print "The URL of this request is:
", $req->url, ",\n";

Notice that methods are invoked using the -> syntax. C++ programmers should feel comfortable with this.

Getting a Document

All the knowledge about fetching a document across the Web is stored in a UserAgent object. The UserAgent object knows how long to wait for responses, how to handle errors, and what to do with the document when it arrives. It does all the hard work—we just need to give it the right information so that it can do its job.

use LWP::UserAgent;
use HTTP::Request;
$agent = new LWP::UserAgent;
$req = new HTTP::Request ('GET',
 'http://w3.one.net/~jweirich/');
$agent->request ($req, \&callback);

This snippet of Perl code creates a UserAgent and a Request object. The Request method of UserAgent issues the request and calls a subroutine called callback with a chunk of data from the arriving document. The callback subroutine may be called many times until the complete document has been received.

Parsing the Document

We could use regular expressions to parse the incoming document to determine the location of all the links, but when you begin to consider that HTML tags may be broken across several lines and all the little variations involved, it becomes a more difficult task. Fortunately, there is an HTML parsing object available in the LWP library, called HTML::LinkExtor, which extracts all the links from an HTML document.

The parser is created and then fed pieces of the document until it reaches the end of the document. Whenever the parser detects links buried in HTML tags, it calls another callback subroutine that we provide. Here is an example that extracts and prints all the links in a document.

use HTML::LinkExtor
$parser = new HTML::LinkExtor (\&LinkCallback);
$parser->parse ($chunk);
$parser->parse ($chunk);
$parser->parse ($chunk);
$parser->eof;
sub LinkCallback {
    my ($tag, %links) = @_;
    print join ("\n", values %links), "\n";
}
______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

did anyone have success

Anonymous's picture

did anyone have success running Listing 1 code??

Use the FTP Luke

Mitch Frazier's picture

Get the version from the ftp server, it hasn't been HTMLized.

Mitch Frazier is an Associate Editor for Linux Journal.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState