More About Searching

Learning to be more efficient in searching by pre-indexing files.
Reading the Index

Now that we have created an index, we will need a program to retrieve results from it. Indeed, that's just the beginning—if we want, we can write multiple programs that read through the DBM file, searching in a variety of ways.

Our first and simplest search will accept input from an HTML form. The form will allow users to specify a single word to search for. Unfortunately, the way we have indexed the files makes it easy to retrieve single words, but impossible to retrieve phrases. A different data structure would make it easier to perform such searches, but would undoubtedly increase the size of the index.

Listing 1

Listing 1 is a simple HTML form we can use for searching. This form expects to submit its results to a CGI program called dbsearch.pl, which is in Listing 4 (see Resources). dbsearch.pl opens the DBM file—again, using MLDBM and DB_File—and retrieves the keys from the hash associated with the search term. By sorting the keys (which contain URLs) by their values (which are integers indicating how many times a word appeared in a file), it's easy to create a simple frequency chart. In this particular case, dbsearch.pl displays all of the results, but it would not be difficult to display only the top 20.

Boolean Search Terms

Searching for a single term might be easy to implement, but the best search engines allow for AND and OR searches. Listing 2 is an HTML form that differs from the previous version by adding a radio button. If the radio button is set to the default OR position, any one of the search terms may appear in the document. An AND search requires that both or all of the search terms appear in the document in order for it to match.

Listing 2

A version of better-dbsearch.pl, which allows for AND and OR searches, is in Listing 5 (see Resources).

If the user chooses OR, dbsearch.pl will have to find the highest total for a combination of hashes, rather than just one. A simple example would be the hashes %odds and %evens, as follows:

%odds =  (one => 1, three => 3, five => 5);
%evens = (two => 2, four => 4, six => 6);

We can turn this into a simple hash by mixing them together:

%numbers = (%odds, %evens);
This technique will not work in our case, because it is quite possible that both words will appear in the same file. If that happens, there will be duplicate keys—unlike %odds and %evens, where no key appears in both hashes. For our OR search to work, we will need to combine the values in a master hash:
foreach my $word (@terms)
{
my $frequency = $data{$word};
foreach my $filename (keys %$frequency)
{
$full_results{$filename} += $frequency->{$filename};
}
}
The above code is not very different from its predecessor in dbsearch.pl. The main change is that it introduces a new hash, %full_results, where the keys are file names and the values reflect the total scores for each of the search terms. This algorithm means a file containing one copy of both search terms will be ranked the same as one in which the same search term appears twice.

To handle AND searches, we must keep track of how many search terms appear in each file. If a file contains the same number of search terms as are in @terms, the file is acceptable. If it contains fewer matches than @terms, the file should be dropped from consideration.

We can implement this by using the %match_counter hash, where the keys are file names and the values are integers representing the number of search terms contained in a file. By adding a single line to the foreach loop, we can keep track of how many search terms appear in each file:

$match_counter{$filename}++ if ($boolean eq "and");

Once we have exited from the foreach loop, but before we announce the results to the user, we prune file names from %full_results. Because the keys for %full_results are the same as for %match_counter, the code is relatively simple:

foreach my $filename (keys %full_results)
{
delete $full_results{$filename}
unless ($match_counter{$filename} == @terms);
}
Notice how we use delete to remove the hash element. Using undef on $full_results{$filename} would make the value undefined, but would leave the hash key unaffected. delete removes the key and its corresponding value from the hash completely.

Also notice how we can get the size of @terms simply by naming the array itself. The == forces scalar context on both of its operands, so there is no need to use the scalar keyword before @terms.

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix