A Simple Search Engine
We now have a fairly full-functioned search program which can handle most types of searches people want to do. The problem is that we have created a program which might be too good to be useful. Many clients of mine put information on their web sites before it is meant to be released. Without any links leading to these directories and documents, it is unlikely someone will be able to find them. However, our search program does not depend on hyperlinks in order to find documents.
One common solution is for a search program to ignore any directory containing a file named .nosearch. This file does not need to contain any data, since its mere existence means a directory's contents will be skipped.
The easiest implementation would check for the existence of a .nosearch file in the directory currently being probed. However, checking for a file with each invocation of find_matches would reduce our program's already slow performance even more. It would be better if the program looked for a .nosearch file, then stored that information in a hash to be retrieved when future files in that directory are examined.
We can solve these problems with two lines of code. The first, placed at the beginning of find_matches, returns immediately if a .nosearch file has already been found in the current directory:
return if ($ignore_directory{$File::Find::dir});
If we reach the second line, it means that no .nosearch file has been found for this directory. However, there are several circumstances under which a .nosearch file wasn't found, yet should still be in force: when we are examining the .nosearch file itself, when a .nosearch file is in the directory or when a .nosearch file is in the parent directory. After all, if the parent directory should not be searched, then neither should the child directory. Here is the code fragment that accomplishes this:
# Mark the directory as ignorable ...
$ignore_directory{$File::Find::dir} = 1
if (($_ eq ".nosearch") ||
(-e ".nosearch") ||
(-e "../.nosearch"));
Listing 10 contains a version of better-cgi-search.pl with these
additions and can be found in the archive file (see Resources).
If you have already run these programs, you most likely found the main problem with the system outlined above: it is very slow. If your web site contains 100 files, this system works just fine. However, if your site expands to 1000 or 10,000 files, users will stop the search in the middle because it will take too long.
For this reason, most serious search engines employ a different strategy, one which separates the searching into two different stages. In the first stage, an indexing program takes the files apart, keeping track of where they might be. A second program is then run as a search client, looking through the pregenerated index for matches.
Next month, we will examine some ways of creating such indices, as well as how to look through them. Perhaps our simple search programs will not be able to complete with Glimpse and ht://Dig, but at least we will understand roughly how they work and what trade-offs are involved when writing search programs.

- « first
- ‹ previous
- 1
- 2
- 3
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- New Products
- Linux Systems Administrator
- Senior Perl Developer
- UX Designer
- Technical Support Rep
- Web & UI Developer (JavaScript & j Query)
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




10 hours 15 min ago
20 hours 56 min ago
1 day 2 hours ago
1 day 2 hours ago
1 day 4 hours ago
1 day 6 hours ago
1 day 13 hours ago
1 day 13 hours ago
1 day 15 hours ago
1 day 21 hours ago