More About Searching
Last month, we looked at a simple search engine for web sites. The program was little more than a CGI program strapped to the File::Find Perl module: each time a user would enter a search term in the HTML form, the search program would dutifully open and examine each of the files under the web hierarchy.
While this sort of search engine works, it is exceedingly inefficient. A site containing several dozen files will not feel too much of a hit when its documents are searched repeatedly by a CGI program, but a site with hundreds or thousands of files, attracting thousands of hits per day, will watch its server's load average skyrocket without very much difficulty.
This month, we will explore ways of making a search engine more efficient. In the end, we will have a search engine which might not work as efficiently as other software, but is simple to install and use. Most importantly, we will get a chance to explore an interesting type of software with inner workings usually invisible to us.
Searching through files sequentially, trying to find matches for a user's input, is an inherently inefficient business. Each file must be opened, read, scanned and closed, which takes time. Perl programs tend to consume a fair amount of memory, so the slow execution speed means more copies of the CGI program will be running at once. This in turn increases the risk that the web server will have to use virtual memory, rather than physical RAM. Slow web servers make for unhappy users, and often convince users not to return at all.
To solve this problem, we must reduce or remove the need for the search program to read through files. If the CGI search program did not have to open each individual file, things would speed up quite a bit.
A tried-and-true solution is to divide it up between two programs. Once or twice each day, an indexing program traverses the web-document tree, reading through each document and analyzing its word use. This program runs behind the scenes without user intervention or knowledge. Rather than sending its results to a user, the indexer dumps all information it has about word frequency and usage and places it in a file on disk.
This means the search program the user invokes via CGI does not actually have to search. Instead, the search program merely opens the index file, finds those files where the user's search term appears the greatest number of times, and displays that list in the user's browser.
Indexing a page is not difficult in Perl, because of its rich library of regular expressions. The m// operator normally matches the regular expression between its delimiters. When invoked with the /g modifier and when operating in list context, it returns all matches it can find. Thus, in the expression
my $found = join ' ',
("encyclopedia" =~ m/[aeiou]/g);
print "$found\n";
the first statement finds all vowels in “encyclopedia” and returns them as a list to the caller. In this case, the caller is Perl's join operator, which pushes the elements together, separated by spaces. Executing the two lines of code above displays the following on the user's screen:
e o a e i aUsing the built-in character class for non-whitespace characters, \S, we can apply the same algorithm in order to extract words from a text string. For example:
my $sentence = "The rain in Spain falls mainly\n\n on the plain."; my $found = join '|', ($sentence =~ m/\b\S+\b/g); print "$found\n";The code above prints the following results:
The|rain|in|Spain|falls|mainly|on|the|plainNotice how using \b (which matches a word boundary) means our program need not worry about newline characters, extra spaces or punctuation.
Indexers have to consider whether to keep case relevant. My personal preference is to ignore case, since users do not necessarily remember, and it also removes an obstacle to finding desired text. We can thus turn all of the words into lowercase letters:
my $sentence = "The rain in Spain falls mainly\n\n on the plain.";
my $found = join '|', map {lc $_}
($sentence =~
m/\b\S+\b/g);
print "$found\n";
To store index information, we will use a hash, %IGNORE. The keys will be words we wish to ignore when indexing. Any non-zero value will indicate this word should be ignored when indexing:
%IGNORE = ("the" => 1, "in" => 1, "on" => 1);
my $sentence = "The rain in Spain falls mainly\n\n on the plain.";
my $found = join '|',
grep {!$IGNORE{$_}}
map {lc $_} ($sentence =~ m/\b\S+\b/g);
print "$found\n";
Notice how we can stack different items together: m// returns a list, which is passed to map, which returns a list, which is fed to grep, which is in turn fed to join, and which is in turn assigned to $found.
Finally, we will index the words by creating a hash (%index) in which the collected words are the keys. The value will be a hash reference, where the key is the name of the current file, and the value is the number of times this word appears in the file. In other words,
$index{spain}->{foo.html} = 5;
means the word “spain” appears in foo.html five times. Here is some code that performs the indexing in this way:
%IGNORE = ("the" => 1, "in" => 1, "on" => 1);
my $sentence = "The rain in Spain falls mainly\n\n on the plain.";
my @found =
grep {!$IGNORE{$_}} map {lc $_} ($sentence =~
m/\b\S+\b/g);
foreach my $word (@found)
{
$index{$word}->{$filename}++;
}
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- RSS Feeds
- New Products
- Trying to Tame the Tablet
- What's the tweeting protocol?
- Dart: a New Web Programming Experience
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




14 min 13 sec ago
4 hours 53 min ago
7 hours 15 min ago
1 day 4 min ago
1 day 2 hours ago
1 day 3 hours ago
1 day 4 hours ago
1 day 4 hours ago
1 day 9 hours ago
1 day 10 hours ago