A Simple Search Engine
The CGI (“common gateway interface”) standard was originally designed to allow users to run programs via the Web, which would otherwise be available only on the server. Thus, the first CGI programs were simple interfaces to grep and finger, which received their inputs from an HTML form and sent the HTML-formatted output to the user's browser.
CGI programs, and server-side programs in general, have become more sophisticated since then. However, one application is as useful now as it was in the past: the ability to search through a web site for documents containing a particular word or string.
While search sites (now called “portals”) make it possible to browse through a large collection of pages spread out over a number of servers, the CGI programs handling the search have an easier job. They have to go through files only on the local server, producing a list of URLs matching the user's request.
This month, we will look at how to implement several different types of search programs. While these programs might not compete successfully with ht://Dig and Webglimpse, they do offer some insight into how these sorts of programs work, and the trade-offs programmers must make when writing such software.
Perl has long been my favorite language for writing server-side programs. This is in no small part due to its strong text-handling capabilities. Perl comes with a rich regular-expression language that makes it easy to find one piece of text inside another.
For example, the following one-line program prints any line of test.txt containing the word “foo”:
perl -ne 'print if m/foo/' test.txt
The -n switch tells Perl not to print lines by default, and the -e switch allows us to insert a program between the single quotes ('). We instruct Perl to print any line in which the m// (match) operator finds the search string. We can accomplish the same thing inside of a program, as shown in Listing 1.
Of course, the above program searches for a single pattern (the string “foo”) inside of a single file (test.txt). We can generalize the program more by using an empty <>, rather than iterating over <FILE>. An empty <> iterates through each element of @ARGV (the array containing command-line arguments), assigning each one in turn to the scalar $ARGV. If there are no command-line arguments, then <> expects to receive input from the user. Listing 2 is a revised version of the above program, which searches through multiple files for the string “foo”. Notice how this version of the program prints the file name as well as the matching line. Since $_ already contains a newline character, we need not put one at the end of the print statement. Listing 2 could be rewritten in a single line of Perl with the following:
perl -ne 'print "$ARGV: $_" if m/foo/;' *
Finally, we can make our simple search a bit more sophisticated by allowing the user to name the pattern, as well as the files. Listing 3 takes the first command-line argument, removing it from @ARGV and putting it in $pattern. To tell Perl that $pattern will not change, and that it should compile the search pattern only a single time, we use m// with the /o option.
Thus, to search for the pattern f.[aeiou] in all of the files with a “txt” extension, we would use:
./simple-search-3.pl "f.[aeiou]" *.txt
Sure enough, every line containing an f, followed by any character, followed by a vowel is printed on the screen, preceded by a file name.
The above would be a good skeleton for our web-based search if all documents on a web site were stored in a single directory. However, the opposite is normally the case: most web sites put files in a number of different directories. A good search program must traverse the entire web hierarchy, searching through each file in each directory.
While we could certainly accomplish this ourselves, someone has already done it for us. File::Find, a package which comes with Perl, makes it possible to create a find-like program using Perl. File::Find exports the find subroutine, which takes a list of arguments. The first argument is a subroutine reference invoked once for each file encountered. The remaining arguments should be directory and file names, which File::Find will read in sequence until it gets to the end.
For example, Listing 4 is a short program that uses File::Find to print all of the file names in a particular directory. As you can see, File::Find exports the variable $File::Find::name which contains the current file name as well as the find subroutine. The current directory is stored in $File::Find::dir.
Listing 5 contains a version of simple-find-2.pl, which uses File::Find to search through all of the files under a given directory tree. As with many programs that use File::Find, the bulk of simple-find-2.pl is spent inside of find_matches, a subroutine called once for every file encountered under the directories passed in @ARGV. To find all files containing the pattern “f.[aeiou]” in directories under /home and /development, type:
./simple-find-2.pl "f.[aeiou]" /home /development
Line 11 of simple-find-2.pl is particularly important, in that it undefines $/, the variable that determines the end-of-line character. Normally, Perl's <> operator iterates through a file line by line, returning undef when the end is reached. However, we want to search across an entire file, since a pattern might have to extend across lines. By undefining $/, the line
my $contents = (<FILE>);puts the entire contents of the file handle FILE inside of $contents, rather than just one line.