More About Searching

Learning to be more efficient in searching by pre-indexing files.
Traversing the Web Hierarchy

Now that we know how to index the words from a string, we will use File::Find to perform this task on the entire web hierarchy. File::Find comes with the standard Perl distribution, and exports a subroutine named find. This subroutine takes a subroutine reference and a list of directories as arguments. It invokes the named subroutine (known as a “callback”) on every file it finds in the named directories. It is the subroutine's responsibility to open the file, retrieve its contents, and index it as necessary.

Listing 3 (see Resources) contains a simple program (index-files.pl) that traverses the web hierarchy, searches through the contents, and stores word frequency in the %index hash. The callback subroutine ignores files in any directory named .noindex, as well as any subdirectories of such directories. This is accomplished with another hash named %ignore_directory. It also ignores non-HTML files and removes HTML tags from the contents of each file before indexing the words it contains. Rather than removing the HTML tags altogether, index-files.pl turns them into whitespace. This means that two words separated by an HTML tag (but no whitespace) will not be jammed together, but will remain separate words.

To avoid problems with HTML entities such as >, & and other items needed for symbols and accented Latin characters, we use the HTML::Entities module. This module exports a decode_entities subroutine that turns the entities back into their regular characters. By running this and then removing the HTML tags, we are able to index a clean document containing only text.

While hashes are convenient and fast, they tend to consume a lot of memory. A hash that points to another hash consumes even more memory, and storing each individual word as a key in the primary hash means the memory usage will skyrocket even more. Running index-file.pl is thus not for the faint of heart or for computers without a significant amount of RAM. This particular version of index-file.pl caused my Linux system (with 72MB of RAM and another 64MB of swap) to run out of memory, particularly when piping the output through less.

Writing %index to Disk

While index-files.pl performs an admirable job of indexing the files in a web hierarchy, it does not write the information to disk. Once index-files.pl exits, all indexing information is lost forever.

The solution is clearly to write the index to disk. But how? The most obvious answer is to use a DBM file. DBM files are most easily described as an on-disk version of hashes. Each entry in a DBM file has a key and a value, both of which may be any character string. As with hashes, DBM files are optimized for speed, making them much faster to use than straight text files. There are several different kinds of DBM files out there, ranging from plain old DBM to the GNU Project's GDBM to Berkeley DB, which has gained quite a bit of popularity in the last few years partly as a result of its integration into Sendmail.

Perl supports many types of DBM files through its “tie” interface, each of which has its own object module. Tying is a means of making a variable magical, attaching additional behavior every time a value is stored or retrieved. Thus, tying a hash to a DBM class means that every time a value is stored to the hash, the value is written to a corresponding DBM file on disk. By the same token, every value retrieval reads the information from disk.

There is one problem here, though: DBM files can store text strings, but nothing more complicated than that. Because each value of %index is a reference to another hash, the normal DBM classes will not work correctly. Attempting to store references in a normal DBM file will actually store their printed representation, which is not at all useful.

The solution is to use the MLDBM class, which is designed precisely for this purpose. MLDBM works in conjunction with another DBM class to store references in a format that can be retrieved later.

To use MLDBM, import it at the top of a program with use, specifying the type of DBM file to use with it:

use MLDBM qw(DB_File);

In this particular case, we will use DB_File, which uses the Berkeley DB module that comes with Perl. (Most Linux systems come with Berkeley DB, as well.) It is also possible to specify the underlying method used to store the references; we will use the default, which takes advantage of the Data::Dumper module available on CPAN. Other options include FreezeThaw and Storable, which perform similar tasks in different ways.

Our hash can then be tied to a DBM file with the tie command:

# In what file should the index be stored?
my $index_file = "/tmp/lj.db";
# Create the word index hash
my %data;
tie %data, "MLDBM", $index_file, O_RDWR|O_CREAT
or die "Error tying %data: $! ";

From this point onward, any value stored to %index will be stored in the file /tmp/lj.db. Any value retrieved from %index will actually be retrieved from /tmp/lj.db. Storing index data in /tmp is a mistake on a production web server, since the file can be deleted by the system.

While MLDBM attempts to function as transparently as possible, it will sometimes cause confusing errors. For example, there is normally no problem with the following Perl code:

$data{$word}->{$url}++;

As we saw earlier, this will increment the counter for a particular word in a particular file. When %data is tied to MLDBM, the above code will no longer work. Instead, the assignment must be performed in two stages, retrieving the hash reference, assigning to it, and placing it back inside of %data:

my $hashref = $data{$word};
$hashref->{$url}++;
$data{$word} = $hashref;
The index should be regenerated on a regular basis to ensure it is as fresh as possible. Consider using cron, which comes with all Linux and UNIX systems, to run index-files.pl every day at 4 AM or at another convenient time when people are unlikely to be using the computer.

While our version of index-files.pl does not support such functionality, it would not be difficult to add a hash key indicating when the file system was last indexed. index-files.pl could then be changed to index only those files that were created or modified after a particular date.

______________________

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState