More About Searching

Learning to be more efficient in searching by pre-indexing files.
Reading the Index

Now that we have created an index, we will need a program to retrieve results from it. Indeed, that's just the beginning—if we want, we can write multiple programs that read through the DBM file, searching in a variety of ways.

Our first and simplest search will accept input from an HTML form. The form will allow users to specify a single word to search for. Unfortunately, the way we have indexed the files makes it easy to retrieve single words, but impossible to retrieve phrases. A different data structure would make it easier to perform such searches, but would undoubtedly increase the size of the index.

Listing 1

Listing 1 is a simple HTML form we can use for searching. This form expects to submit its results to a CGI program called, which is in Listing 4 (see Resources). opens the DBM file—again, using MLDBM and DB_File—and retrieves the keys from the hash associated with the search term. By sorting the keys (which contain URLs) by their values (which are integers indicating how many times a word appeared in a file), it's easy to create a simple frequency chart. In this particular case, displays all of the results, but it would not be difficult to display only the top 20.

Boolean Search Terms

Searching for a single term might be easy to implement, but the best search engines allow for AND and OR searches. Listing 2 is an HTML form that differs from the previous version by adding a radio button. If the radio button is set to the default OR position, any one of the search terms may appear in the document. An AND search requires that both or all of the search terms appear in the document in order for it to match.

Listing 2

A version of, which allows for AND and OR searches, is in Listing 5 (see Resources).

If the user chooses OR, will have to find the highest total for a combination of hashes, rather than just one. A simple example would be the hashes %odds and %evens, as follows:

%odds =  (one => 1, three => 3, five => 5);
%evens = (two => 2, four => 4, six => 6);

We can turn this into a simple hash by mixing them together:

%numbers = (%odds, %evens);
This technique will not work in our case, because it is quite possible that both words will appear in the same file. If that happens, there will be duplicate keys—unlike %odds and %evens, where no key appears in both hashes. For our OR search to work, we will need to combine the values in a master hash:
foreach my $word (@terms)
my $frequency = $data{$word};
foreach my $filename (keys %$frequency)
$full_results{$filename} += $frequency->{$filename};
The above code is not very different from its predecessor in The main change is that it introduces a new hash, %full_results, where the keys are file names and the values reflect the total scores for each of the search terms. This algorithm means a file containing one copy of both search terms will be ranked the same as one in which the same search term appears twice.

To handle AND searches, we must keep track of how many search terms appear in each file. If a file contains the same number of search terms as are in @terms, the file is acceptable. If it contains fewer matches than @terms, the file should be dropped from consideration.

We can implement this by using the %match_counter hash, where the keys are file names and the values are integers representing the number of search terms contained in a file. By adding a single line to the foreach loop, we can keep track of how many search terms appear in each file:

$match_counter{$filename}++ if ($boolean eq "and");

Once we have exited from the foreach loop, but before we announce the results to the user, we prune file names from %full_results. Because the keys for %full_results are the same as for %match_counter, the code is relatively simple:

foreach my $filename (keys %full_results)
delete $full_results{$filename}
unless ($match_counter{$filename} == @terms);
Notice how we use delete to remove the hash element. Using undef on $full_results{$filename} would make the value undefined, but would leave the hash key unaffected. delete removes the key and its corresponding value from the hash completely.

Also notice how we can get the size of @terms simply by naming the array itself. The == forces scalar context on both of its operands, so there is no need to use the scalar keyword before @terms.


White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState