A Simple Search Engine

Searching your web site has never been easier—an introduction to search methods.
Excluding Directories and Files

We now have a fairly full-functioned search program which can handle most types of searches people want to do. The problem is that we have created a program which might be too good to be useful. Many clients of mine put information on their web sites before it is meant to be released. Without any links leading to these directories and documents, it is unlikely someone will be able to find them. However, our search program does not depend on hyperlinks in order to find documents.

One common solution is for a search program to ignore any directory containing a file named .nosearch. This file does not need to contain any data, since its mere existence means a directory's contents will be skipped.

The easiest implementation would check for the existence of a .nosearch file in the directory currently being probed. However, checking for a file with each invocation of find_matches would reduce our program's already slow performance even more. It would be better if the program looked for a .nosearch file, then stored that information in a hash to be retrieved when future files in that directory are examined.

The Other Problem

We can solve these problems with two lines of code. The first, placed at the beginning of find_matches, returns immediately if a .nosearch file has already been found in the current directory:

return if ($ignore_directory{$File::Find::dir});

If we reach the second line, it means that no .nosearch file has been found for this directory. However, there are several circumstances under which a .nosearch file wasn't found, yet should still be in force: when we are examining the .nosearch file itself, when a .nosearch file is in the directory or when a .nosearch file is in the parent directory. After all, if the parent directory should not be searched, then neither should the child directory. Here is the code fragment that accomplishes this:

# Mark the directory as ignorable ...
    $ignore_directory{$File::Find::dir} = 1
        if (($_ eq ".nosearch") ||
            (-e ".nosearch") ||
            (-e "../.nosearch"));
Listing 10 contains a version of better-cgi-search.pl with these additions and can be found in the archive file (see Resources).

Is This Any Way to Run a Search?

If you have already run these programs, you most likely found the main problem with the system outlined above: it is very slow. If your web site contains 100 files, this system works just fine. However, if your site expands to 1000 or 10,000 files, users will stop the search in the middle because it will take too long.

For this reason, most serious search engines employ a different strategy, one which separates the searching into two different stages. In the first stage, an indexing program takes the files apart, keeping track of where they might be. A second program is then run as a search client, looking through the pregenerated index for matches.

Next month, we will examine some ways of creating such indices, as well as how to look through them. Perhaps our simple search programs will not be able to complete with Glimpse and ht://Dig, but at least we will understand roughly how they work and what trade-offs are involved when writing search programs.

Reuven M. Lerner is an Internet and Web consultant living in Haifa, Israel, who has been using the Web since early 1993. His book Core Perl will be published by Prentice-Hall in the spring. Reuven can be reached at reuven@lerner.co.il. The ATF home page, including archives and discussion forums, is at http://www.lerner.co.il/atf/.