A Web Crawler in Perl
How might we use the spider program, other than as a curiosity? One use for the program would be as a replacement for one of the web site index and query programs like Harvest (http://harvest.cs.colorado.edu/Harvest/) or Excite for Web Servers (http://www.excite.com/navigate/prodinfo.html). These programs are large and complicated. They often provide the functionality of the Perl spider program, a means of archiving the text retrieved and a CGI query engine to run against the resulting database. Ongoing maintenance is required, since the query engine runs against the database rather than against the actual site content; therefore, the database must be regenerated whenever a change is made to the content of the site.
Some search engines, such as Excite for Web Servers, cannot index the content at a remote site. These engines build their database from the files which make up the web site, rather than from data retrieved across a network. If you had two web sites whose content was to appear in a single search application, these tools would not be appropriate. Furthermore, the Linux version of Excite for Web Servers is still in the “coming soon” stage.
Listing 2 and Listing 3 show a simple CGI search engine that is implemented using the spider.pl program. Listing 2 is an HTML form which calls spiderfind.cgi to process its input. Listing 3 is spiderfind.cgi. It first uses Brigitte Jellinek's library to move the data entered in the form into an associative array. It then calls the spider.pl program using the Perl system() function and passes the form data as parameters. Finally, it converts the output from spider.pl into a series of HTML links. The user's browser will display a list of hyperlinked URLs in which the search text was found. Note that the name of the host to search is specified by a hidden field in the HTML document. There are better and more security-conscious ways for two Perl programs to interact than through a Perl system() call, but I wanted to use an unmodified copy of spider.pl for this demonstration.
This script doesn't provide the complete functionality of the packages mentioned above, and it won't perform as well. Since we're doing the search against web server documents across the Net, we don't have the advantage of index files; therefore, the search will be slower and more processor-intensive. However, this script is easy to install and easier to maintain than those engines.
Another application that could be built using the spider.pl program is a broken link scanner for the Web. The HTTP response we showed previously began with the line “HTTP/1.0 200 OK”, indicating the request could be fulfilled. If we tried to hit a URL with a non-existent document, we would get the line “HTTP/1.0 404 Not found” instead. We could use this as an indication that the document does not exist and print the URL which referenced this page.
The modifications to the spider program needed to accomplish this are minor. Every time a hyperlink's URL is added to the URL queue, we also record the URL of the document in which we found the hyperlink. Then, when the spider checks out the hyperlink and receives a “404 Not found” response, it outputs the URL of the referring page.
|Red Hat Enterprise Linux 7.1 beta available on IBM Power Platform||Jan 23, 2015|
|Designing with Linux||Jan 22, 2015|
|Wondershaper—QOS in a Pinch||Jan 21, 2015|
|Ideal Backups with zbackup||Jan 19, 2015|
|Non-Linux FOSS: Animation Made Easy||Jan 14, 2015|
|Internet of Things Blows Away CES, and it May Be Hunting for YOU Next||Jan 12, 2015|
- Designing with Linux
- Wondershaper—QOS in a Pinch
- Red Hat Enterprise Linux 7.1 beta available on IBM Power Platform
- Internet of Things Blows Away CES, and it May Be Hunting for YOU Next
- Ideal Backups with zbackup
- Slow System? iotop Is Your Friend
- Hats Off to Mozilla
- Non-Linux FOSS: Animation Made Easy
- diff -u: What's New in Kernel Development
- New Products
Editorial Advisory Panel
Thank you to our 2014 Editorial Advisors!
- Jeff Parent
- Brad Baillio
- Nick Baronian
- Steve Case
- Chadalavada Kalyana
- Caleb Cullen
- Keir Davis
- Michael Eager
- Nick Faltys
- Dennis Frey
- Philip Jacob
- Jay Kruizenga
- Steve Marquez
- Dave McAllister
- Craig Oda
- Mike Roberts
- Chris Stark
- Patrick Swartz
- David Lynch
- Alicia Gibb
- Thomas Quinlan
- Carson McDonald
- Kristen Shoemaker
- Charnell Luchich
- James Walker
- Victor Gregorio
- Hari Boukis
- Brian Conner
- David Lane