More About Searching

Learning to be more efficient in searching by pre-indexing files.

Last month, we looked at a simple search engine for web sites. The program was little more than a CGI program strapped to the File::Find Perl module: each time a user would enter a search term in the HTML form, the search program would dutifully open and examine each of the files under the web hierarchy.

While this sort of search engine works, it is exceedingly inefficient. A site containing several dozen files will not feel too much of a hit when its documents are searched repeatedly by a CGI program, but a site with hundreds or thousands of files, attracting thousands of hits per day, will watch its server's load average skyrocket without very much difficulty.

This month, we will explore ways of making a search engine more efficient. In the end, we will have a search engine which might not work as efficiently as other software, but is simple to install and use. Most importantly, we will get a chance to explore an interesting type of software with inner workings usually invisible to us.

The Secret: Pre-indexing

Searching through files sequentially, trying to find matches for a user's input, is an inherently inefficient business. Each file must be opened, read, scanned and closed, which takes time. Perl programs tend to consume a fair amount of memory, so the slow execution speed means more copies of the CGI program will be running at once. This in turn increases the risk that the web server will have to use virtual memory, rather than physical RAM. Slow web servers make for unhappy users, and often convince users not to return at all.

To solve this problem, we must reduce or remove the need for the search program to read through files. If the CGI search program did not have to open each individual file, things would speed up quite a bit.

A tried-and-true solution is to divide it up between two programs. Once or twice each day, an indexing program traverses the web-document tree, reading through each document and analyzing its word use. This program runs behind the scenes without user intervention or knowledge. Rather than sending its results to a user, the indexer dumps all information it has about word frequency and usage and places it in a file on disk.

This means the search program the user invokes via CGI does not actually have to search. Instead, the search program merely opens the index file, finds those files where the user's search term appears the greatest number of times, and displays that list in the user's browser.

Indexing a page is not difficult in Perl, because of its rich library of regular expressions. The m// operator normally matches the regular expression between its delimiters. When invoked with the /g modifier and when operating in list context, it returns all matches it can find. Thus, in the expression

my $found = join ' ',
  ("encyclopedia" =~ m/[aeiou]/g);
print "$found\n";

the first statement finds all vowels in “encyclopedia” and returns them as a list to the caller. In this case, the caller is Perl's join operator, which pushes the elements together, separated by spaces. Executing the two lines of code above displays the following on the user's screen:

e o a e i a
Using the built-in character class for non-whitespace characters, \S, we can apply the same algorithm in order to extract words from a text string. For example:
my $sentence = "The rain in Spain falls mainly\n\n on the plain.";
my $found = join '|', ($sentence =~ m/\b\S+\b/g);
print "$found\n";
The code above prints the following results:
Notice how using \b (which matches a word boundary) means our program need not worry about newline characters, extra spaces or punctuation.

Indexers have to consider whether to keep case relevant. My personal preference is to ignore case, since users do not necessarily remember, and it also removes an obstacle to finding desired text. We can thus turn all of the words into lowercase letters:

my $sentence = "The rain in Spain falls mainly\n\n on the plain.";
my $found = join '|', map {lc $_}
   ($sentence =~
print "$found\n";
Storing the Index

To store index information, we will use a hash, %IGNORE. The keys will be words we wish to ignore when indexing. Any non-zero value will indicate this word should be ignored when indexing:

%IGNORE = ("the" => 1, "in" => 1, "on" => 1);
my $sentence = "The rain in Spain falls mainly\n\n on the plain.";
my $found = join '|',
   grep {!$IGNORE{$_}}
   map {lc $_} ($sentence =~ m/\b\S+\b/g);
print "$found\n";

Notice how we can stack different items together: m// returns a list, which is passed to map, which returns a list, which is fed to grep, which is in turn fed to join, and which is in turn assigned to $found.

Finally, we will index the words by creating a hash (%index) in which the collected words are the keys. The value will be a hash reference, where the key is the name of the current file, and the value is the number of times this word appears in the file. In other words,

$index{spain}->{foo.html} = 5;

means the word “spain” appears in foo.html five times. Here is some code that performs the indexing in this way:

%IGNORE = ("the" => 1, "in" => 1, "on" => 1);
my $sentence = "The rain in Spain falls mainly\n\n on the plain.";
my @found =
  grep {!$IGNORE{$_}} map {lc $_} ($sentence =~
    foreach my $word (@found)