Using Perl to Check Web Links
We now have all the tools we need to build our checklinks script. We will define two operations for URLs. When we scan a URL, we will fetch the document (using a UserAgent) and scan it for internal HTML links. Every new link we find will be added to a list of URLs to be checked.
Next, check a link to see if it points to a valid web document. We could try retrieving the entire document to see if the document exists, but the HTTP protocol defines a HEAD request that gets only the document's date, length and a few other attributes. Since a HEAD request can be much faster than a full GET for large documents, and since it tells us what we need to know, we will use the head() function of the LWP::Simple package to check a URL. If head() returns an undefined value, then the document specified by the URL cannot be fetched and we add the URL to a list of bad URLs. If head() returns a list, the URL is valid and it is added to the list of good URLs. Finally, if the valid URL points to a page in our local web space and ends with “.html” or “.htm”, we add the URL to a list of URLs to be scanned.
The scanning process produces more URLs to be checked. Checking these URLS produces more URLs that need to be scanned. As URLs are checked, they are moved to the good or bad list. Since we restrict scanning to URLs in our local web space, eventually we will scan all local URLs that are reachable from our starting document.
When there are no more URLs to be scanned and all URLs have been checked, we can print the list of bad URLs and the list of files that contain them.
The complete code to checklinks is found in Listing 1. You will need Perl 5 to be able to run the checklinks routine. You will also need a recent copy of the LWP library. When I installed LWP, I also had to update the IO and Net modules. You can find Perl, and the LWP, IO and Net modules at http://www.perl.com/perl.
You can invoke checklinks on a single URL with the command:
If you wish to scan all local URLs reachable from the main URL, add a -r option.
Running checklinks on my home system against my entire set of web pages took about 13 minutes to complete. Most of that time was spent waiting for the bad URLs to timeout. It scanned 76 pages, checked 289 URLs, and found 31 links that were no longer valid. Now all I have to do is find the time to clean up my web pages!
Jim Weirich is a software consultant for Compuware specializing in Unix and C++. When he is not working on his web pages, you can find him playing guitar, playing with his kids, or playing with Linux. Comments are welcome at email@example.com or visit http://w3.one.net/~jweirich.
- VMware's Clarity Design System
- My Childhood in a Cigar Box
- Applied Expert Systems, Inc.'s CleverView for TCP/IP on Linux
- Let's Go to Mars with Martian Lander
- Papa's Got a Brand New NAS
- Jetico's BestCrypt Container Encryption for Linux
- Debugging Democracy
- Tech Tip: Really Simple HTTP Server with Python
- Fancy Tricks for Changing Numeric Base
- Smith Charts for All