Using Perl to Check Web Links
We now have all the tools we need to build our checklinks script. We will define two operations for URLs. When we scan a URL, we will fetch the document (using a UserAgent) and scan it for internal HTML links. Every new link we find will be added to a list of URLs to be checked.
Next, check a link to see if it points to a valid web document. We could try retrieving the entire document to see if the document exists, but the HTTP protocol defines a HEAD request that gets only the document's date, length and a few other attributes. Since a HEAD request can be much faster than a full GET for large documents, and since it tells us what we need to know, we will use the head() function of the LWP::Simple package to check a URL. If head() returns an undefined value, then the document specified by the URL cannot be fetched and we add the URL to a list of bad URLs. If head() returns a list, the URL is valid and it is added to the list of good URLs. Finally, if the valid URL points to a page in our local web space and ends with “.html” or “.htm”, we add the URL to a list of URLs to be scanned.
The scanning process produces more URLs to be checked. Checking these URLS produces more URLs that need to be scanned. As URLs are checked, they are moved to the good or bad list. Since we restrict scanning to URLs in our local web space, eventually we will scan all local URLs that are reachable from our starting document.
When there are no more URLs to be scanned and all URLs have been checked, we can print the list of bad URLs and the list of files that contain them.
The complete code to checklinks is found in Listing 1. You will need Perl 5 to be able to run the checklinks routine. You will also need a recent copy of the LWP library. When I installed LWP, I also had to update the IO and Net modules. You can find Perl, and the LWP, IO and Net modules at http://www.perl.com/perl.
You can invoke checklinks on a single URL with the command:
If you wish to scan all local URLs reachable from the main URL, add a -r option.
Running checklinks on my home system against my entire set of web pages took about 13 minutes to complete. Most of that time was spent waiting for the bad URLs to timeout. It scanned 76 pages, checked 289 URLs, and found 31 links that were no longer valid. Now all I have to do is find the time to clean up my web pages!
Jim Weirich is a software consultant for Compuware specializing in Unix and C++. When he is not working on his web pages, you can find him playing guitar, playing with his kids, or playing with Linux. Comments are welcome at email@example.com or visit http://w3.one.net/~jweirich.
|Happy Birthday Linux||Aug 25, 2016|
|ContainerCon Vendors Offer Flexible Solutions for Managing All Your New Micro-VMs||Aug 24, 2016|
|Updates from LinuxCon and ContainerCon, Toronto, August 2016||Aug 23, 2016|
|NVMe over Fabrics Support Coming to the Linux 4.8 Kernel||Aug 22, 2016|
|What I Wish I’d Known When I Was an Embedded Linux Newbie||Aug 18, 2016|
|Pandas||Aug 17, 2016|
- Happy Birthday Linux
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- ContainerCon Vendors Offer Flexible Solutions for Managing All Your New Micro-VMs
- What I Wish I’d Known When I Was an Embedded Linux Newbie
- Updates from LinuxCon and ContainerCon, Toronto, August 2016
- NVMe over Fabrics Support Coming to the Linux 4.8 Kernel
- New Version of GParted
- Tor 0.2.8.6 Is Released
- All about printf
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide