A Web Crawler in Perl
How might we use the spider program, other than as a curiosity? One use for the program would be as a replacement for one of the web site index and query programs like Harvest (http://harvest.cs.colorado.edu/Harvest/) or Excite for Web Servers (http://www.excite.com/navigate/prodinfo.html). These programs are large and complicated. They often provide the functionality of the Perl spider program, a means of archiving the text retrieved and a CGI query engine to run against the resulting database. Ongoing maintenance is required, since the query engine runs against the database rather than against the actual site content; therefore, the database must be regenerated whenever a change is made to the content of the site.
Some search engines, such as Excite for Web Servers, cannot index the content at a remote site. These engines build their database from the files which make up the web site, rather than from data retrieved across a network. If you had two web sites whose content was to appear in a single search application, these tools would not be appropriate. Furthermore, the Linux version of Excite for Web Servers is still in the “coming soon” stage.
Listing 2 and Listing 3 show a simple CGI search engine that is implemented using the spider.pl program. Listing 2 is an HTML form which calls spiderfind.cgi to process its input. Listing 3 is spiderfind.cgi. It first uses Brigitte Jellinek's library to move the data entered in the form into an associative array. It then calls the spider.pl program using the Perl system() function and passes the form data as parameters. Finally, it converts the output from spider.pl into a series of HTML links. The user's browser will display a list of hyperlinked URLs in which the search text was found. Note that the name of the host to search is specified by a hidden field in the HTML document. There are better and more security-conscious ways for two Perl programs to interact than through a Perl system() call, but I wanted to use an unmodified copy of spider.pl for this demonstration.
This script doesn't provide the complete functionality of the packages mentioned above, and it won't perform as well. Since we're doing the search against web server documents across the Net, we don't have the advantage of index files; therefore, the search will be slower and more processor-intensive. However, this script is easy to install and easier to maintain than those engines.
Another application that could be built using the spider.pl program is a broken link scanner for the Web. The HTTP response we showed previously began with the line “HTTP/1.0 200 OK”, indicating the request could be fulfilled. If we tried to hit a URL with a non-existent document, we would get the line “HTTP/1.0 404 Not found” instead. We could use this as an indication that the document does not exist and print the URL which referenced this page.
The modifications to the spider program needed to accomplish this are minor. Every time a hyperlink's URL is added to the URL queue, we also record the URL of the document in which we found the hyperlink. Then, when the spider checks out the hyperlink and receives a “404 Not found” response, it outputs the URL of the referring page.
- Brent Laster's Professional Git (Wrox)
- Own Your DNS Data
- Machine Learning Everywhere
- Smoothwall Express
- Bash Shell Script: Building a Better March Madness Bracket
- Simple Server Hardening
- From vs. to + for Microsoft and Linux
- Understanding OpenStack's Success
- The Weather Outside Is Frightful (Or Is It?)
- Ensono M.O.