A Web Crawler in Perl

 in
Here's how spiders search the Web collecting information for you.
Application of the spider.pl Script

How might we use the spider program, other than as a curiosity? One use for the program would be as a replacement for one of the web site index and query programs like Harvest (http://harvest.cs.colorado.edu/Harvest/) or Excite for Web Servers (http://www.excite.com/navigate/prodinfo.html). These programs are large and complicated. They often provide the functionality of the Perl spider program, a means of archiving the text retrieved and a CGI query engine to run against the resulting database. Ongoing maintenance is required, since the query engine runs against the database rather than against the actual site content; therefore, the database must be regenerated whenever a change is made to the content of the site.

Some search engines, such as Excite for Web Servers, cannot index the content at a remote site. These engines build their database from the files which make up the web site, rather than from data retrieved across a network. If you had two web sites whose content was to appear in a single search application, these tools would not be appropriate. Furthermore, the Linux version of Excite for Web Servers is still in the “coming soon” stage.

Listing 2 and Listing 3 show a simple CGI search engine that is implemented using the spider.pl program. Listing 2 is an HTML form which calls spiderfind.cgi to process its input. Listing 3 is spiderfind.cgi. It first uses Brigitte Jellinek's library to move the data entered in the form into an associative array. It then calls the spider.pl program using the Perl system() function and passes the form data as parameters. Finally, it converts the output from spider.pl into a series of HTML links. The user's browser will display a list of hyperlinked URLs in which the search text was found. Note that the name of the host to search is specified by a hidden field in the HTML document. There are better and more security-conscious ways for two Perl programs to interact than through a Perl system() call, but I wanted to use an unmodified copy of spider.pl for this demonstration.

This script doesn't provide the complete functionality of the packages mentioned above, and it won't perform as well. Since we're doing the search against web server documents across the Net, we don't have the advantage of index files; therefore, the search will be slower and more processor-intensive. However, this script is easy to install and easier to maintain than those engines.

Another application that could be built using the spider.pl program is a broken link scanner for the Web. The HTTP response we showed previously began with the line “HTTP/1.0 200 OK”, indicating the request could be fulfilled. If we tried to hit a URL with a non-existent document, we would get the line “HTTP/1.0 404 Not found” instead. We could use this as an indication that the document does not exist and print the URL which referenced this page.

The modifications to the spider program needed to accomplish this are minor. Every time a hyperlink's URL is added to the URL queue, we also record the URL of the document in which we found the hyperlink. Then, when the spider checks out the hyperlink and receives a “404 Not found” response, it outputs the URL of the referring page.

Mike Thomas is an Internet application developer working for a consulting firm in Saskatchewan, Canada. Mike lives in Massachusetts and uses two Linux systems to telecommute 2000 miles to his job and to Graduate School at the University of Regina. He can be reached by e-mail at thomas@javanet.com.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

code download

Bejjan's picture

Does anyone have the source ? because the url in the article doesnt work anymore.

/Jimmy

Search engine componants

dhanesh mane's picture

I am working on Search engine architecture. when I search for search engine basic architecture then there r lots of diff images and details available.

Still I am searching for generic search engine architecture and details of each and every component of it. please let me know on this.

where to download spider.pl

Anonymous's picture

is there anyone one who have a version of this script working? Even though it doesn't through any error, I don't get any result. Thanks ahead.
S.

error while running script

Anonymous's picture

when i run this script on my ubuntu, i got the folowing error :

ERROR: Unknown host www.ssc.com.

please anyone help me

thx spider.pl

pingo's picture

It is year 2008 and I just saw your spider.pl. I love it because it contains rich technical information I was looking for. I know Perl but not enough! I tried to run the program from a winows Me through explorer, the program sends the "GET /$document HTTP/1.0

" but it does not get the response back! Do I need to configure the explorer or do something? By the way you said the program runs for Linux! It is running allright with Me so far! Could you please tell me why I can not get any response back?

Thanks a lot to people like you.

Personal:

I see you are telecomuting. I just start doing this. If you need help please feel free to see my site:

www.softwarefreedown.com

How to set proxy

Kinshuk Chandra's picture

Mike I tried to run the program but i got the error , unknown host.
So plz tell me how to set the proxy in your spider.pl.
Thx in advance

The spider application

Norton Security's picture

Thanks Mike for this great tutorial. However, when I am trying to access the spider.pl program from the location you provided, it fails saying there's no such directory on the server. Could you please correct?

Thanks,
SecurityBay

urgent

Lassaad's picture

Good Morning ,

I whould download your script spider.pl from http://www.javanet.com/~thomas/. but this website is not actif please if you can send me this script on my mail ing.lassaad@hotmail.com
or skype ytlassaad

think you very much ...

Im very sorry for my inglish I speak French...
Goog day.

Good job. foritmail3@hotmail

Forti's picture

Good job.
foritmail3@hotmail.com
Thanks.

Good job, thanks.

Anonymous's picture

Good job, thanks.

Re: A Web Crawler in Perl

Anonymous's picture

Hi Mike;

It is year 2002 and I just saw your spider.pl. I love it because it contains rich technical information I was looking for. I know Perl but not enough! I tried to run the program from a winows Me through explorer, the program sends the "GET /$document HTTP/1.0

" but it does not get the response back! Do I need to configure the explorer or do something? By the way you said the program runs for Linux! It is running allright with Me so far! Could you please tell me why I can not get any response back?

Thanks a lot to people like you.

Personal:

I see you are telecomuting. I just start doing this. If you need help please feel free to see my site:

www.softek-inc.com

Regards

Beheen Trimble

SW Engineer

Kinda old

Anonymous's picture

But still good.
Thanks.

Kinda old

Anonymous's picture

But still good.
Thanks.

./spider.pl

Anonymous's picture

./spider.pl http://www.ssc.com/ "Linux Journal"
syntax error at ./spider.pl line 236, near ">>>>"
syntax error at ./spider.pl line 250, near "line)"
syntax error at ./spider.pl line 265, near "elsif"
syntax error at ./spider.pl line 269, near "else"
Execution of ./spider.pl aborted due to compilation errors.

Why do i get these errors?

I get the same errors

Anonymous's picture

the style sucks so bad it lost a "}" at the end of the file.
change the ">>>>>>>=" to just ">="

Unfortunately, the running version of this script still sucks.

Sucking Script

Anonymous's picture

Yeah well this script may work well, however you should really check your syntax before positing.

Like the guy above, I get the following erros:

syntax error at D:\Source\perl\TestArea\image_finder\igrab2.pl line 236, near ">>>>"
syntax error at D:\Source\perl\TestArea\image_finder\igrab2.pl line 250, near "line)"
syntax error at D:\Source\perl\TestArea\image_finder\igrab2.pl line 265, near "elsif"
syntax error at D:\Source\perl\TestArea\image_finder\igrab2.pl line 269, near "else"

once there resolved, I still get :

Missing right curly or square bracket at D:\Source\perl\TestArea\image_finder\igrab2.pl line 273, at end of line
syntax error at D:\Source\perl\TestArea\image_finder\igrab2.pl line 273, at EOF
Execution of D:\Source\perl\TestArea\image_finder\igrab2.pl aborted due to compilation errors.

Better luck next time.

EOF

Anonymous's picture

just add a } at the end coz it's missing!

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix