A Web Crawler in Perl
How might we use the spider program, other than as a curiosity? One use for the program would be as a replacement for one of the web site index and query programs like Harvest (http://harvest.cs.colorado.edu/Harvest/) or Excite for Web Servers (http://www.excite.com/navigate/prodinfo.html). These programs are large and complicated. They often provide the functionality of the Perl spider program, a means of archiving the text retrieved and a CGI query engine to run against the resulting database. Ongoing maintenance is required, since the query engine runs against the database rather than against the actual site content; therefore, the database must be regenerated whenever a change is made to the content of the site.
Some search engines, such as Excite for Web Servers, cannot index the content at a remote site. These engines build their database from the files which make up the web site, rather than from data retrieved across a network. If you had two web sites whose content was to appear in a single search application, these tools would not be appropriate. Furthermore, the Linux version of Excite for Web Servers is still in the “coming soon” stage.
Listing 2 and Listing 3 show a simple CGI search engine that is implemented using the spider.pl program. Listing 2 is an HTML form which calls spiderfind.cgi to process its input. Listing 3 is spiderfind.cgi. It first uses Brigitte Jellinek's library to move the data entered in the form into an associative array. It then calls the spider.pl program using the Perl system() function and passes the form data as parameters. Finally, it converts the output from spider.pl into a series of HTML links. The user's browser will display a list of hyperlinked URLs in which the search text was found. Note that the name of the host to search is specified by a hidden field in the HTML document. There are better and more security-conscious ways for two Perl programs to interact than through a Perl system() call, but I wanted to use an unmodified copy of spider.pl for this demonstration.
This script doesn't provide the complete functionality of the packages mentioned above, and it won't perform as well. Since we're doing the search against web server documents across the Net, we don't have the advantage of index files; therefore, the search will be slower and more processor-intensive. However, this script is easy to install and easier to maintain than those engines.
Another application that could be built using the spider.pl program is a broken link scanner for the Web. The HTTP response we showed previously began with the line “HTTP/1.0 200 OK”, indicating the request could be fulfilled. If we tried to hit a URL with a non-existent document, we would get the line “HTTP/1.0 404 Not found” instead. We could use this as an indication that the document does not exist and print the URL which referenced this page.
The modifications to the spider program needed to accomplish this are minor. Every time a hyperlink's URL is added to the URL queue, we also record the URL of the document in which we found the hyperlink. Then, when the spider checks out the hyperlink and receives a “404 Not found” response, it outputs the URL of the referring page.
Webinar: 8 Signs You’re Beyond Cron
On Demand NOW
Join Linux Journal and Pat Cameron, Director of Automation Technology at HelpSystems, as they discuss the eight primary advantages of moving beyond cron job scheduling. In this webinar, you’ll learn about integrating cron with an enterprise scheduler.View Now!
|Dr Hjkl on the Command Line||May 21, 2015|
|Initializing and Managing Services in Linux: Past, Present and Future||May 20, 2015|
|Goodbye, Pi. Hello, C.H.I.P.||May 18, 2015|
|Enter to Win Archive DVD + Free Backup Solution||May 18, 2015|
|Using Hiera with Puppet||May 14, 2015|
|Urgent Kernel Patch for Ubuntu||May 12, 2015|
- Dr Hjkl on the Command Line
- Initializing and Managing Services in Linux: Past, Present and Future
- Goodbye, Pi. Hello, C.H.I.P.
- Using Hiera with Puppet
- Enter to Win Archive DVD + Free Backup Solution
- Gartner Dubs DivvyCloud Cool Cloud Management Vendor
- Infinite BusyBox with systemd
- Mumblehard--Let's End Its Five-Year Reign
- A More Stable Future for Ubuntu
- Urgent Kernel Patch for Ubuntu