Using Perl to Check Web Links

Do you have many links on your web pages? If so, you're probably finding that the pages at the ends of those links are disappearing faster than you can track them. Ah, but now you can get the computer to handle that for you.

One of the first things I did when I got my first Internet account was put together my own set of web pages. The one I get the most comments about is called “Weirichs on the Web” where I link to other Weirichs I have found on the Web. Although a lot of fun, keeping the links up to date can be very tedious. As web pages that I reference are moved or deleted, links to them become stale. Without constant checking, it is difficult to keep my links current.

So, I began to wonder, is there a way to automatically find the outdated links in a web page? What I needed was a script that would scan all of my web pages and report every bad HTML link along with the web page on which it was used.

There are several parts to this problem. Our script must be able to:

  • fetch a web document from the Web

  • extract a list of URLs from a web document

  • test a URL to see if it is valid

The LWP Library

We could write code by hand to extract URLs and validate them, but there is a much easier way. LWP is a Perl library (available from any CPAN archive site) designed to make accessing the World Wide Web very easy in Perl. LWP uses Perl objects to provide Web-related services to a client. Perl objects are a recent addition to the Perl language and many people might not be familiar with them.

Perl objects are references to “things” that know what class they belong to. These “things” are usually anonymous hashes but you don't need to know this to use an object. Classes are packages that provide the methods the object uses to implement its behavior. And finally, a method is a function (in the class package) that expects an object reference (or sometimes a package name) as its first argument.

If this sounds confusing, don't worry. Using objects is very easy. LWP defines a class called HTTP::Request that represents a request to be sent on the Web. The request to GET a document at URL http://w3.one.net/~jweirich can be created with the statement:

$req = new HTTP::Request GET,
 'http://w3.one.net/~jweirich';

new creates a new Request object initialized with the GET and http://w3.one.net/~jweirich parameters. This new object is assigned to the $req variable.

Calling a member function of an object is equally straightforward. For example, if you want to examine the URL for this request, you can invoke the url method on this object.

print "The URL of this request is:
", $req->url, ",\n";

Notice that methods are invoked using the -> syntax. C++ programmers should feel comfortable with this.

Getting a Document

All the knowledge about fetching a document across the Web is stored in a UserAgent object. The UserAgent object knows how long to wait for responses, how to handle errors, and what to do with the document when it arrives. It does all the hard work—we just need to give it the right information so that it can do its job.

use LWP::UserAgent;
use HTTP::Request;
$agent = new LWP::UserAgent;
$req = new HTTP::Request ('GET',
 'http://w3.one.net/~jweirich/');
$agent->request ($req, \&callback);

This snippet of Perl code creates a UserAgent and a Request object. The Request method of UserAgent issues the request and calls a subroutine called callback with a chunk of data from the arriving document. The callback subroutine may be called many times until the complete document has been received.

Parsing the Document

We could use regular expressions to parse the incoming document to determine the location of all the links, but when you begin to consider that HTML tags may be broken across several lines and all the little variations involved, it becomes a more difficult task. Fortunately, there is an HTML parsing object available in the LWP library, called HTML::LinkExtor, which extracts all the links from an HTML document.

The parser is created and then fed pieces of the document until it reaches the end of the document. Whenever the parser detects links buried in HTML tags, it calls another callback subroutine that we provide. Here is an example that extracts and prints all the links in a document.

use HTML::LinkExtor
$parser = new HTML::LinkExtor (\&LinkCallback);
$parser->parse ($chunk);
$parser->parse ($chunk);
$parser->parse ($chunk);
$parser->eof;
sub LinkCallback {
    my ($tag, %links) = @_;
    print join ("\n", values %links), "\n";
}
______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

did anyone have success

Anonymous's picture

did anyone have success running Listing 1 code??

Use the FTP Luke

Mitch Frazier's picture

Get the version from the ftp server, it hasn't been HTMLized.

Mitch Frazier is an Associate Editor for Linux Journal.

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix