Using Perl to Check Web Links
One of the first things I did when I got my first Internet account was put together my own set of web pages. The one I get the most comments about is called “Weirichs on the Web” where I link to other Weirichs I have found on the Web. Although a lot of fun, keeping the links up to date can be very tedious. As web pages that I reference are moved or deleted, links to them become stale. Without constant checking, it is difficult to keep my links current.
So, I began to wonder, is there a way to automatically find the outdated links in a web page? What I needed was a script that would scan all of my web pages and report every bad HTML link along with the web page on which it was used.
There are several parts to this problem. Our script must be able to:
fetch a web document from the Web
extract a list of URLs from a web document
test a URL to see if it is valid
We could write code by hand to extract URLs and validate them, but there is a much easier way. LWP is a Perl library (available from any CPAN archive site) designed to make accessing the World Wide Web very easy in Perl. LWP uses Perl objects to provide Web-related services to a client. Perl objects are a recent addition to the Perl language and many people might not be familiar with them.
Perl objects are references to “things” that know what class they belong to. These “things” are usually anonymous hashes but you don't need to know this to use an object. Classes are packages that provide the methods the object uses to implement its behavior. And finally, a method is a function (in the class package) that expects an object reference (or sometimes a package name) as its first argument.
If this sounds confusing, don't worry. Using objects is very easy. LWP defines a class called HTTP::Request that represents a request to be sent on the Web. The request to GET a document at URL http://w3.one.net/~jweirich can be created with the statement:
$req = new HTTP::Request GET, 'http://w3.one.net/~jweirich';
new creates a new Request object initialized with the GET and http://w3.one.net/~jweirich parameters. This new object is assigned to the $req variable.
Calling a member function of an object is equally straightforward. For example, if you want to examine the URL for this request, you can invoke the url method on this object.
print "The URL of this request is: ", $req->url, ",\n";
Notice that methods are invoked using the -> syntax. C++ programmers should feel comfortable with this.
All the knowledge about fetching a document across the Web is stored in a UserAgent object. The UserAgent object knows how long to wait for responses, how to handle errors, and what to do with the document when it arrives. It does all the hard work—we just need to give it the right information so that it can do its job.
use LWP::UserAgent;
use HTTP::Request;
$agent = new LWP::UserAgent;
$req = new HTTP::Request ('GET',
'http://w3.one.net/~jweirich/');
$agent->request ($req, \&callback);
This snippet of Perl code creates a UserAgent and a Request object. The Request method of UserAgent issues the request and calls a subroutine called callback with a chunk of data from the arriving document. The callback subroutine may be called many times until the complete document has been received.
We could use regular expressions to parse the incoming document to determine the location of all the links, but when you begin to consider that HTML tags may be broken across several lines and all the little variations involved, it becomes a more difficult task. Fortunately, there is an HTML parsing object available in the LWP library, called HTML::LinkExtor, which extracts all the links from an HTML document.
The parser is created and then fed pieces of the document until it reaches the end of the document. Whenever the parser detects links buried in HTML tags, it calls another callback subroutine that we provide. Here is an example that extracts and prints all the links in a document.
use HTML::LinkExtor
$parser = new HTML::LinkExtor (\&LinkCallback);
$parser->parse ($chunk);
$parser->parse ($chunk);
$parser->parse ($chunk);
$parser->eof;
sub LinkCallback {
my ($tag, %links) = @_;
print join ("\n", values %links), "\n";
}
Trending Topics
| Make TV Awesome with Bluecop | May 16, 2012 |
| Hack and / - Password Cracking with GPUs, Part I: the Setup | May 15, 2012 |
| An Introduction to Application Development with Catalyst and Perl | May 14, 2012 |
| Cryptocurrency: Your Total Cost Is 01001010010 | May 09, 2012 |
| HTML5 for Audio Applications | May 07, 2012 |
| May 2012 Issue of Linux Journal: Programming | May 02, 2012 |
- Hack and / - Password Cracking with GPUs, Part I: the Setup
- An Introduction to Application Development with Catalyst and Perl
- Validate an E-Mail Address with PHP, the Right Way
- Make TV Awesome with Bluecop
- Monitoring Hard Disks with SMART
- Which one is the Best Free and Paid PDF editor for Mac
- Examining Load Average
- Readers' Choice Awards 2011
- Bash Regular Expressions
- Building an Ultra-Low-Power File Server with the Trim-Slice
- It's true that maintaining
2 hours 13 min ago - as powerful as anything the
7 hours 50 min ago - Excellent!
11 hours 13 min ago - You can mount ext2 and ext3
19 hours 33 sec ago - Awsome Post
1 day 7 hours ago - Math Worksheets
1 day 11 hours ago - Healthy eating effective weight loss of p57
1 day 17 hours ago - Good work, looking for more!
1 day 17 hours ago - I’ve been reading a number of
1 day 18 hours ago - With freeware SKim, You can
1 day 18 hours ago






Comments
did anyone have success
did anyone have success running Listing 1 code??
Use the FTP Luke
Get the version from the ftp server, it hasn't been HTMLized.
Mitch Frazier is an Associate Editor for Linux Journal.