A Simple Search Engine
The CGI (“common gateway interface”) standard was originally designed to allow users to run programs via the Web, which would otherwise be available only on the server. Thus, the first CGI programs were simple interfaces to grep and finger, which received their inputs from an HTML form and sent the HTML-formatted output to the user's browser.
CGI programs, and server-side programs in general, have become more sophisticated since then. However, one application is as useful now as it was in the past: the ability to search through a web site for documents containing a particular word or string.
While search sites (now called “portals”) make it possible to browse through a large collection of pages spread out over a number of servers, the CGI programs handling the search have an easier job. They have to go through files only on the local server, producing a list of URLs matching the user's request.
This month, we will look at how to implement several different types of search programs. While these programs might not compete successfully with ht://Dig and Webglimpse, they do offer some insight into how these sorts of programs work, and the trade-offs programmers must make when writing such software.
Perl has long been my favorite language for writing server-side programs. This is in no small part due to its strong text-handling capabilities. Perl comes with a rich regular-expression language that makes it easy to find one piece of text inside another.
For example, the following one-line program prints any line of test.txt containing the word “foo”:
perl -ne 'print if m/foo/' test.txt
The -n switch tells Perl not to print lines by default, and the -e switch allows us to insert a program between the single quotes ('). We instruct Perl to print any line in which the m// (match) operator finds the search string. We can accomplish the same thing inside of a program, as shown in Listing 1.
Of course, the above program searches for a single pattern (the string “foo”) inside of a single file (test.txt). We can generalize the program more by using an empty <>, rather than iterating over <FILE>. An empty <> iterates through each element of @ARGV (the array containing command-line arguments), assigning each one in turn to the scalar $ARGV. If there are no command-line arguments, then <> expects to receive input from the user. Listing 2 is a revised version of the above program, which searches through multiple files for the string “foo”. Notice how this version of the program prints the file name as well as the matching line. Since $_ already contains a newline character, we need not put one at the end of the print statement. Listing 2 could be rewritten in a single line of Perl with the following:
perl -ne 'print "$ARGV: $_" if m/foo/;' *
Finally, we can make our simple search a bit more sophisticated by allowing the user to name the pattern, as well as the files. Listing 3 takes the first command-line argument, removing it from @ARGV and putting it in $pattern. To tell Perl that $pattern will not change, and that it should compile the search pattern only a single time, we use m// with the /o option.
Thus, to search for the pattern f.[aeiou] in all of the files with a “txt” extension, we would use:
./simple-search-3.pl "f.[aeiou]" *.txt
Sure enough, every line containing an f, followed by any character, followed by a vowel is printed on the screen, preceded by a file name.
The above would be a good skeleton for our web-based search if all documents on a web site were stored in a single directory. However, the opposite is normally the case: most web sites put files in a number of different directories. A good search program must traverse the entire web hierarchy, searching through each file in each directory.
While we could certainly accomplish this ourselves, someone has already done it for us. File::Find, a package which comes with Perl, makes it possible to create a find-like program using Perl. File::Find exports the find subroutine, which takes a list of arguments. The first argument is a subroutine reference invoked once for each file encountered. The remaining arguments should be directory and file names, which File::Find will read in sequence until it gets to the end.
For example, Listing 4 is a short program that uses File::Find to print all of the file names in a particular directory. As you can see, File::Find exports the variable $File::Find::name which contains the current file name as well as the find subroutine. The current directory is stored in $File::Find::dir.
Listing 5 contains a version of simple-find-2.pl, which uses File::Find to search through all of the files under a given directory tree. As with many programs that use File::Find, the bulk of simple-find-2.pl is spent inside of find_matches, a subroutine called once for every file encountered under the directories passed in @ARGV. To find all files containing the pattern “f.[aeiou]” in directories under /home and /development, type:
./simple-find-2.pl "f.[aeiou]" /home /development
Line 11 of simple-find-2.pl is particularly important, in that it undefines $/, the variable that determines the end-of-line character. Normally, Perl's <> operator iterates through a file line by line, returning undef when the end is reached. However, we want to search across an entire file, since a pattern might have to extend across lines. By undefining $/, the line
my $contents = (<FILE>);puts the entire contents of the file handle FILE inside of $contents, rather than just one line.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Speed Up Your Web Site with Varnish | Jun 19, 2013 |
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
- Speed Up Your Web Site with Varnish
- Containers—Not Virtual Machines—Are the Future Cloud
- Linux Systems Administrator
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Non-Linux FOSS: libnotify, OS X Style
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- RSS Feeds
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




1 hour 29 min ago
5 hours 29 min ago
6 hours 45 min ago
10 hours 17 min ago
13 hours 10 min ago
13 hours 36 min ago
16 hours 4 min ago
16 hours 38 min ago
16 hours 39 min ago
16 hours 40 min ago