A Simple Search Engine

Searching your web site has never been easier—an introduction to search methods.
Searching via the Web

Now that we can search for a pattern through all files under a particular directory, let's connect this functionality to the Web, searching through all of the files under the HTTP server's document hierarchy. Such a program will need to receive only a pattern from the user, since the web hierarchy does not change very often.

Listing 6

Listing 6 is an HTML form that could be used to provide such input. This HTML form will submit its contents to simple-cgi-find.pl, the CGI program in Listing 7. Its parameter, pattern, contains a Perl pattern to be compared with the contents of each file in the web hierarchy, simple-cgi-find.pl will return a list of documents matching the user's pattern.

Listing 7

Unfortunately, the version of File::Find that comes with Perl does not work with the -T flag, which turns on Perl's secure tainting mode. CGI programs should always be run with -T, which ensures data from outside sources is not used in potentially compromising ways. In this case, however, we cannot run our program with -T. File::Find relies on the fastcwd routine in the Cwd module, which cannot be run successfully with -T. For the time being, I suggest using these programs without -T, but when the next version of Perl is released, I strongly recommend upgrading in order to run CGI programs with full tainting enabled.

Our search subroutine, find_matches, has been modified slightly, so that its results will be more relevant for web users. The first thing it does is to make sure the file has an extension indicating it contains HTML-formatted text or plain text. This ensures that the search will not try to view graphics files, which can contain any characters:

return unless (m/\.html?$/i or m/\.te?xt$/i);

Some web sites mark HTML files with extensions of .htm (or .HTM), and their text files with .txt or .TXT rather than .text. The above pattern allows for all of these variations, ignoring case with the /i switch and ensuring the suffix comes at the end of the pattern with the $ metacharacter.

After retrieving the contents of the current file, find_matches checks to see if $pattern can be found inside of $contents, which contains the document's contents. We surround $pattern with \b characters, to look for $pattern on word boundaries. This ensures that searching for “foo” will not match the word “food”, even though the former is a subset of the latter.

If a match is found, find_matches creates a URL by substituting $search_root with $url_root, which hides the HTML document hierarchy from outside users. It then prints the file name inside a hyperlink to that URL:

if ($contents =~ m|\b$pattern\b|ios)
my $url = "$File::Find::dir/$filename";
$url =~ s/$search_root/$url_origin/;
print qq{<li><a href="$url">$filename</a>\n}
Improving on our Web Search

While simple-cgi-find.pl works, it does have a few problems. For starters, it fails to differentiate between HTML tags and actual content. Searching for “IMG” should not match any document containing an <IMG> tag, but rather any content outside of HTML tags that contains that string. For this reason, we will modify our program to remove HTML tags from the input file.

Beginning Perl programmers often think that the best way to remove HTML tags is to remove anything between < and >, as in:

$contents =~ s|<.+>||g;

Since “.” tells Perl to match any character and “+” tells Perl to match one or more of the preceding character, the statement above looks like it tells Perl to remove all of the HTML tags. Unfortunately, this is not the case—the statement will remove everything between the first < and the final > appearing in the file. This is because Perl's patterns are “greedy”, and try to maximize the number of characters they match.

We can make “+” non-greedy and try to match only the minimum number of characters by placing a ? after it. For example:

$contents =~ s|<.+?>||g;

There is also the sticky issue of what to do if $pattern contains white space. Should it be considered as a search phrase containing one or more white-space characters? Or should it be considered several different words with an “or” or “and” search?

Listing 8

In this particular case, we can have our cake and eat it, too. By adding a set of radio buttons to the HTML form, we can allow the user to choose whether a search should be literal, require all search terms be found or require any one of the search terms be found.

Now we can modify our program to handle “phrase” searches (as we have been doing until now), “and” searches (in which all of the words must appear) and “or” searches (in which one or more of the words must appear).

To implement an “and” search, we break the elements of phrase apart by using Perl's “split” operator. We then count the number of words we must find, iterating over each of them and checking to see if they all exist in $contents. If $counter reaches 0, we can be sure all words appear:

elsif ($search_type eq "and")
    my @words = split /\s+/, $pattern;
    my $count = scalar @words;
    foreach my $word (@words)
    $count- if ($contents =~ m|\b$word\b|is);
    unless ($count)
    print qq{<li><a href="$url">$filename</a>\n};

An “or” search is even easier to implement: once again, we break apart $phrase across white space. If even one of the constituent words matches, we can immediately print the file name and hyperlink, and return from find_matches:

elsif ($search_type eq "or")
    my @words = split /\s+/, $pattern;
    foreach my $word (@words)
    if ($contents =~ m|\b$word\b|is)
    print qq{<li><a href="$url">$filename</a>\n};
Finally, we should have some way of telling the user how many documents matched. We do this by creating a new variable, $total_matches, which is incremented each time a document matches (as seen in the above code fragments for “and” and “or” searches).

These improvements are incorporated into the search program called better-cgi-search.pl, in Listing 9, not printed here but contained in the archive file (see Resources).


Geek Guide
The DevOps Toolbox

Tools and Technologies for Scale and Reliability
by Linux Journal Editor Bill Childers

Get your free copy today

Sponsored by IBM

Upcoming Webinar
8 Signs You're Beyond Cron

Scheduling Crontabs With an Enterprise Scheduler
11am CDT, April 29th
Moderated by Linux Journal Contributor Mike Diehl

Sign up now

Sponsored by Skybot