Using Perl to Check Web Links

Do you have many links on your web pages? If so, you're probably finding that the pages at the ends of those links are disappearing faster than you can track them. Ah, but now you can get the computer to handle that for you.
Putting It Together

We now have all the tools we need to build our checklinks script. We will define two operations for URLs. When we scan a URL, we will fetch the document (using a UserAgent) and scan it for internal HTML links. Every new link we find will be added to a list of URLs to be checked.

Next, check a link to see if it points to a valid web document. We could try retrieving the entire document to see if the document exists, but the HTTP protocol defines a HEAD request that gets only the document's date, length and a few other attributes. Since a HEAD request can be much faster than a full GET for large documents, and since it tells us what we need to know, we will use the head() function of the LWP::Simple package to check a URL. If head() returns an undefined value, then the document specified by the URL cannot be fetched and we add the URL to a list of bad URLs. If head() returns a list, the URL is valid and it is added to the list of good URLs. Finally, if the valid URL points to a page in our local web space and ends with “.html” or “.htm”, we add the URL to a list of URLs to be scanned.

The scanning process produces more URLs to be checked. Checking these URLS produces more URLs that need to be scanned. As URLs are checked, they are moved to the good or bad list. Since we restrict scanning to URLs in our local web space, eventually we will scan all local URLs that are reachable from our starting document.

When there are no more URLs to be scanned and all URLs have been checked, we can print the list of bad URLs and the list of files that contain them.


The complete code to checklinks is found in Listing 1. You will need Perl 5 to be able to run the checklinks routine. You will also need a recent copy of the LWP library. When I installed LWP, I also had to update the IO and Net modules. You can find Perl, and the LWP, IO and Net modules at

You can invoke checklinks on a single URL with the command:

checklinks url

If you wish to scan all local URLs reachable from the main URL, add a -r option.

Running checklinks on my home system against my entire set of web pages took about 13 minutes to complete. Most of that time was spent waiting for the bad URLs to timeout. It scanned 76 pages, checked 289 URLs, and found 31 links that were no longer valid. Now all I have to do is find the time to clean up my web pages!

Jim Weirich is a software consultant for Compuware specializing in Unix and C++. When he is not working on his web pages, you can find him playing guitar, playing with his kids, or playing with Linux. Comments are welcome at or visit



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

did anyone have success

Anonymous's picture

did anyone have success running Listing 1 code??

Use the FTP Luke

Mitch Frazier's picture

Get the version from the ftp server, it hasn't been HTMLized.

Mitch Frazier is an Associate Editor for Linux Journal.