A Web Crawler in Perl
How might we use the spider program, other than as a curiosity? One use for the program would be as a replacement for one of the web site index and query programs like Harvest (http://harvest.cs.colorado.edu/Harvest/) or Excite for Web Servers (http://www.excite.com/navigate/prodinfo.html). These programs are large and complicated. They often provide the functionality of the Perl spider program, a means of archiving the text retrieved and a CGI query engine to run against the resulting database. Ongoing maintenance is required, since the query engine runs against the database rather than against the actual site content; therefore, the database must be regenerated whenever a change is made to the content of the site.
Some search engines, such as Excite for Web Servers, cannot index the content at a remote site. These engines build their database from the files which make up the web site, rather than from data retrieved across a network. If you had two web sites whose content was to appear in a single search application, these tools would not be appropriate. Furthermore, the Linux version of Excite for Web Servers is still in the “coming soon” stage.
Listing 2 and Listing 3 show a simple CGI search engine that is implemented using the spider.pl program. Listing 2 is an HTML form which calls spiderfind.cgi to process its input. Listing 3 is spiderfind.cgi. It first uses Brigitte Jellinek's library to move the data entered in the form into an associative array. It then calls the spider.pl program using the Perl system() function and passes the form data as parameters. Finally, it converts the output from spider.pl into a series of HTML links. The user's browser will display a list of hyperlinked URLs in which the search text was found. Note that the name of the host to search is specified by a hidden field in the HTML document. There are better and more security-conscious ways for two Perl programs to interact than through a Perl system() call, but I wanted to use an unmodified copy of spider.pl for this demonstration.
This script doesn't provide the complete functionality of the packages mentioned above, and it won't perform as well. Since we're doing the search against web server documents across the Net, we don't have the advantage of index files; therefore, the search will be slower and more processor-intensive. However, this script is easy to install and easier to maintain than those engines.
Another application that could be built using the spider.pl program is a broken link scanner for the Web. The HTTP response we showed previously began with the line “HTTP/1.0 200 OK”, indicating the request could be fulfilled. If we tried to hit a URL with a non-existent document, we would get the line “HTTP/1.0 404 Not found” instead. We could use this as an indication that the document does not exist and print the URL which referenced this page.
The modifications to the spider program needed to accomplish this are minor. Every time a hyperlink's URL is added to the URL queue, we also record the URL of the document in which we found the hyperlink. Then, when the spider checks out the hyperlink and receives a “404 Not found” response, it outputs the URL of the referring page.
Win an iPhone 6
Enter to Win
|Geek Hide-away in Guatemala - Stay for Free!||Nov 26, 2015|
|Microsoft and Linux: True Romance or Toxic Love?||Nov 25, 2015|
|Non-Linux FOSS: Install Windows? Yeah, Open Source Can Do That.||Nov 24, 2015|
|Cipher Security: How to harden TLS and SSH||Nov 23, 2015|
|Web Stores Held Hostage||Nov 19, 2015|
|diff -u: What's New in Kernel Development||Nov 17, 2015|
- Microsoft and Linux: True Romance or Toxic Love?
- Cipher Security: How to harden TLS and SSH
- Non-Linux FOSS: Install Windows? Yeah, Open Source Can Do That.
- Geek Hide-away in Guatemala - Stay for Free!
- Web Stores Held Hostage
- Firefox's New Feature for Tighter Security
- It's a Bird. It's Another Bird!
- PuppetLabs Introduces Application Orchestration
- diff -u: What's New in Kernel Development
- IBM LinuxONE Provides New Options for Linux Deployment