Using Perl to Check Web Links
We now have all the tools we need to build our checklinks script. We will define two operations for URLs. When we scan a URL, we will fetch the document (using a UserAgent) and scan it for internal HTML links. Every new link we find will be added to a list of URLs to be checked.
Next, check a link to see if it points to a valid web document. We could try retrieving the entire document to see if the document exists, but the HTTP protocol defines a HEAD request that gets only the document's date, length and a few other attributes. Since a HEAD request can be much faster than a full GET for large documents, and since it tells us what we need to know, we will use the head() function of the LWP::Simple package to check a URL. If head() returns an undefined value, then the document specified by the URL cannot be fetched and we add the URL to a list of bad URLs. If head() returns a list, the URL is valid and it is added to the list of good URLs. Finally, if the valid URL points to a page in our local web space and ends with “.html” or “.htm”, we add the URL to a list of URLs to be scanned.
The scanning process produces more URLs to be checked. Checking these URLS produces more URLs that need to be scanned. As URLs are checked, they are moved to the good or bad list. Since we restrict scanning to URLs in our local web space, eventually we will scan all local URLs that are reachable from our starting document.
When there are no more URLs to be scanned and all URLs have been checked, we can print the list of bad URLs and the list of files that contain them.
The complete code to checklinks is found in Listing 1. You will need Perl 5 to be able to run the checklinks routine. You will also need a recent copy of the LWP library. When I installed LWP, I also had to update the IO and Net modules. You can find Perl, and the LWP, IO and Net modules at http://www.perl.com/perl.
You can invoke checklinks on a single URL with the command:
If you wish to scan all local URLs reachable from the main URL, add a -r option.
Running checklinks on my home system against my entire set of web pages took about 13 minutes to complete. Most of that time was spent waiting for the bad URLs to timeout. It scanned 76 pages, checked 289 URLs, and found 31 links that were no longer valid. Now all I have to do is find the time to clean up my web pages!
Jim Weirich is a software consultant for Compuware specializing in Unix and C++. When he is not working on his web pages, you can find him playing guitar, playing with his kids, or playing with Linux. Comments are welcome at email@example.com or visit http://w3.one.net/~jweirich.
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems
Join editor Bill Childers and Bit9's Paul Riegle on April 27 at 12pm Central to learn how to keep your Linux systems secure.
Free to Linux Journal readers.Register Now!
- New Products
- Security Hardening with Ansible
- diff -u: What's New in Kernel Development
- NSA: Linux Journal is an "extremist forum" and its readers get flagged for extra surveillance
- Monitoring Android Traffic with Wireshark
- New Products
- [<Megashare>] Watch Mrs Brown's Boys Movie Online Full Movie HD 2014
- Tech Tip: Really Simple HTTP Server with Python
- RSS Feeds
- ~Putlocker~2014 Watch Boyhood Online Streaming Full Movie