Using Perl to Check Web Links
We now have all the tools we need to build our checklinks script. We will define two operations for URLs. When we scan a URL, we will fetch the document (using a UserAgent) and scan it for internal HTML links. Every new link we find will be added to a list of URLs to be checked.
Next, check a link to see if it points to a valid web document. We could try retrieving the entire document to see if the document exists, but the HTTP protocol defines a HEAD request that gets only the document's date, length and a few other attributes. Since a HEAD request can be much faster than a full GET for large documents, and since it tells us what we need to know, we will use the head() function of the LWP::Simple package to check a URL. If head() returns an undefined value, then the document specified by the URL cannot be fetched and we add the URL to a list of bad URLs. If head() returns a list, the URL is valid and it is added to the list of good URLs. Finally, if the valid URL points to a page in our local web space and ends with “.html” or “.htm”, we add the URL to a list of URLs to be scanned.
The scanning process produces more URLs to be checked. Checking these URLS produces more URLs that need to be scanned. As URLs are checked, they are moved to the good or bad list. Since we restrict scanning to URLs in our local web space, eventually we will scan all local URLs that are reachable from our starting document.
When there are no more URLs to be scanned and all URLs have been checked, we can print the list of bad URLs and the list of files that contain them.
The complete code to checklinks is found in Listing 1. You will need Perl 5 to be able to run the checklinks routine. You will also need a recent copy of the LWP library. When I installed LWP, I also had to update the IO and Net modules. You can find Perl, and the LWP, IO and Net modules at http://www.perl.com/perl.
You can invoke checklinks on a single URL with the command:
If you wish to scan all local URLs reachable from the main URL, add a -r option.
Running checklinks on my home system against my entire set of web pages took about 13 minutes to complete. Most of that time was spent waiting for the bad URLs to timeout. It scanned 76 pages, checked 289 URLs, and found 31 links that were no longer valid. Now all I have to do is find the time to clean up my web pages!
Jim Weirich is a software consultant for Compuware specializing in Unix and C++. When he is not working on his web pages, you can find him playing guitar, playing with his kids, or playing with Linux. Comments are welcome at firstname.lastname@example.org or visit http://w3.one.net/~jweirich.
Special Reports: DevOps
Have projects in development that need help? Have a great development operation in place that can ALWAYS be better? Regardless of where you are in your DevOps process, Linux Journal can help!
With deep focus on Collaborative Development, Continuous Testing and Release & Deployment, we offer here the DEFINITIVE DevOps for Dummies, a mobile Application Development Primer, advice & help from the experts, plus a host of other books, videos, podcasts and more. All free with a quick, one-time registration. Start browsing now...
- SUSE – “Will not diverge from its Open Source roots!”
- Dealing with Boundary Issues
- Vagrant Simplified
- Libreboot on an X60, Part I: the Setup
- System Status as SMS Text Messages
- Bluetooth Hacks
- October 2015 Issue of Linux Journal: Raspberry Pi
- Disney's Linux Light Bulbs (Not a "Luxo Jr." Reboot)
- New Products
- October 2015 Video Preview