More About Searching

Learning to be more efficient in searching by pre-indexing files.
Traversing the Web Hierarchy

Now that we know how to index the words from a string, we will use File::Find to perform this task on the entire web hierarchy. File::Find comes with the standard Perl distribution, and exports a subroutine named find. This subroutine takes a subroutine reference and a list of directories as arguments. It invokes the named subroutine (known as a “callback”) on every file it finds in the named directories. It is the subroutine's responsibility to open the file, retrieve its contents, and index it as necessary.

Listing 3 (see Resources) contains a simple program (index-files.pl) that traverses the web hierarchy, searches through the contents, and stores word frequency in the %index hash. The callback subroutine ignores files in any directory named .noindex, as well as any subdirectories of such directories. This is accomplished with another hash named %ignore_directory. It also ignores non-HTML files and removes HTML tags from the contents of each file before indexing the words it contains. Rather than removing the HTML tags altogether, index-files.pl turns them into whitespace. This means that two words separated by an HTML tag (but no whitespace) will not be jammed together, but will remain separate words.

To avoid problems with HTML entities such as >, & and other items needed for symbols and accented Latin characters, we use the HTML::Entities module. This module exports a decode_entities subroutine that turns the entities back into their regular characters. By running this and then removing the HTML tags, we are able to index a clean document containing only text.

While hashes are convenient and fast, they tend to consume a lot of memory. A hash that points to another hash consumes even more memory, and storing each individual word as a key in the primary hash means the memory usage will skyrocket even more. Running index-file.pl is thus not for the faint of heart or for computers without a significant amount of RAM. This particular version of index-file.pl caused my Linux system (with 72MB of RAM and another 64MB of swap) to run out of memory, particularly when piping the output through less.

Writing %index to Disk

While index-files.pl performs an admirable job of indexing the files in a web hierarchy, it does not write the information to disk. Once index-files.pl exits, all indexing information is lost forever.

The solution is clearly to write the index to disk. But how? The most obvious answer is to use a DBM file. DBM files are most easily described as an on-disk version of hashes. Each entry in a DBM file has a key and a value, both of which may be any character string. As with hashes, DBM files are optimized for speed, making them much faster to use than straight text files. There are several different kinds of DBM files out there, ranging from plain old DBM to the GNU Project's GDBM to Berkeley DB, which has gained quite a bit of popularity in the last few years partly as a result of its integration into Sendmail.

Perl supports many types of DBM files through its “tie” interface, each of which has its own object module. Tying is a means of making a variable magical, attaching additional behavior every time a value is stored or retrieved. Thus, tying a hash to a DBM class means that every time a value is stored to the hash, the value is written to a corresponding DBM file on disk. By the same token, every value retrieval reads the information from disk.

There is one problem here, though: DBM files can store text strings, but nothing more complicated than that. Because each value of %index is a reference to another hash, the normal DBM classes will not work correctly. Attempting to store references in a normal DBM file will actually store their printed representation, which is not at all useful.

The solution is to use the MLDBM class, which is designed precisely for this purpose. MLDBM works in conjunction with another DBM class to store references in a format that can be retrieved later.

To use MLDBM, import it at the top of a program with use, specifying the type of DBM file to use with it:

use MLDBM qw(DB_File);

In this particular case, we will use DB_File, which uses the Berkeley DB module that comes with Perl. (Most Linux systems come with Berkeley DB, as well.) It is also possible to specify the underlying method used to store the references; we will use the default, which takes advantage of the Data::Dumper module available on CPAN. Other options include FreezeThaw and Storable, which perform similar tasks in different ways.

Our hash can then be tied to a DBM file with the tie command:

# In what file should the index be stored?
my $index_file = "/tmp/lj.db";
# Create the word index hash
my %data;
tie %data, "MLDBM", $index_file, O_RDWR|O_CREAT
or die "Error tying %data: $! ";

From this point onward, any value stored to %index will be stored in the file /tmp/lj.db. Any value retrieved from %index will actually be retrieved from /tmp/lj.db. Storing index data in /tmp is a mistake on a production web server, since the file can be deleted by the system.

While MLDBM attempts to function as transparently as possible, it will sometimes cause confusing errors. For example, there is normally no problem with the following Perl code:

$data{$word}->{$url}++;

As we saw earlier, this will increment the counter for a particular word in a particular file. When %data is tied to MLDBM, the above code will no longer work. Instead, the assignment must be performed in two stages, retrieving the hash reference, assigning to it, and placing it back inside of %data:

my $hashref = $data{$word};
$hashref->{$url}++;
$data{$word} = $hashref;
The index should be regenerated on a regular basis to ensure it is as fresh as possible. Consider using cron, which comes with all Linux and UNIX systems, to run index-files.pl every day at 4 AM or at another convenient time when people are unlikely to be using the computer.

While our version of index-files.pl does not support such functionality, it would not be difficult to add a hash key indicating when the file system was last indexed. index-files.pl could then be changed to index only those files that were created or modified after a particular date.

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix