A Simple Search Engine
The CGI (“common gateway interface”) standard was originally designed to allow users to run programs via the Web, which would otherwise be available only on the server. Thus, the first CGI programs were simple interfaces to grep and finger, which received their inputs from an HTML form and sent the HTML-formatted output to the user's browser.
CGI programs, and server-side programs in general, have become more sophisticated since then. However, one application is as useful now as it was in the past: the ability to search through a web site for documents containing a particular word or string.
While search sites (now called “portals”) make it possible to browse through a large collection of pages spread out over a number of servers, the CGI programs handling the search have an easier job. They have to go through files only on the local server, producing a list of URLs matching the user's request.
This month, we will look at how to implement several different types of search programs. While these programs might not compete successfully with ht://Dig and Webglimpse, they do offer some insight into how these sorts of programs work, and the trade-offs programmers must make when writing such software.
Perl has long been my favorite language for writing server-side programs. This is in no small part due to its strong text-handling capabilities. Perl comes with a rich regular-expression language that makes it easy to find one piece of text inside another.
For example, the following one-line program prints any line of test.txt containing the word “foo”:
perl -ne 'print if m/foo/' test.txt
The -n switch tells Perl not to print lines by default, and the -e switch allows us to insert a program between the single quotes ('). We instruct Perl to print any line in which the m// (match) operator finds the search string. We can accomplish the same thing inside of a program, as shown in Listing 1.
Of course, the above program searches for a single pattern (the string “foo”) inside of a single file (test.txt). We can generalize the program more by using an empty <>, rather than iterating over <FILE>. An empty <> iterates through each element of @ARGV (the array containing command-line arguments), assigning each one in turn to the scalar $ARGV. If there are no command-line arguments, then <> expects to receive input from the user. Listing 2 is a revised version of the above program, which searches through multiple files for the string “foo”. Notice how this version of the program prints the file name as well as the matching line. Since $_ already contains a newline character, we need not put one at the end of the print statement. Listing 2 could be rewritten in a single line of Perl with the following:
perl -ne 'print "$ARGV: $_" if m/foo/;' *
Finally, we can make our simple search a bit more sophisticated by allowing the user to name the pattern, as well as the files. Listing 3 takes the first command-line argument, removing it from @ARGV and putting it in $pattern. To tell Perl that $pattern will not change, and that it should compile the search pattern only a single time, we use m// with the /o option.
Thus, to search for the pattern f.[aeiou] in all of the files with a “txt” extension, we would use:
./simple-search-3.pl "f.[aeiou]" *.txt
Sure enough, every line containing an f, followed by any character, followed by a vowel is printed on the screen, preceded by a file name.
The above would be a good skeleton for our web-based search if all documents on a web site were stored in a single directory. However, the opposite is normally the case: most web sites put files in a number of different directories. A good search program must traverse the entire web hierarchy, searching through each file in each directory.
While we could certainly accomplish this ourselves, someone has already done it for us. File::Find, a package which comes with Perl, makes it possible to create a find-like program using Perl. File::Find exports the find subroutine, which takes a list of arguments. The first argument is a subroutine reference invoked once for each file encountered. The remaining arguments should be directory and file names, which File::Find will read in sequence until it gets to the end.
For example, Listing 4 is a short program that uses File::Find to print all of the file names in a particular directory. As you can see, File::Find exports the variable $File::Find::name which contains the current file name as well as the find subroutine. The current directory is stored in $File::Find::dir.
Listing 5 contains a version of simple-find-2.pl, which uses File::Find to search through all of the files under a given directory tree. As with many programs that use File::Find, the bulk of simple-find-2.pl is spent inside of find_matches, a subroutine called once for every file encountered under the directories passed in @ARGV. To find all files containing the pattern “f.[aeiou]” in directories under /home and /development, type:
./simple-find-2.pl "f.[aeiou]" /home /development
Line 11 of simple-find-2.pl is particularly important, in that it undefines $/, the variable that determines the end-of-line character. Normally, Perl's <> operator iterates through a file line by line, returning undef when the end is reached. However, we want to search across an entire file, since a pattern might have to extend across lines. By undefining $/, the line
my $contents = (<FILE>);puts the entire contents of the file handle FILE inside of $contents, rather than just one line.
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- ServersCheck's Thermal Imaging Camera Sensor
- The Italian Army Switches to LibreOffice
- Linux Mint 18
- Petros Koutoupis' RapidDisk
- Oracle vs. Google: Round 2
- The FBI and the Mozilla Foundation Lock Horns over Known Security Hole
- Privacy and the New Math
- Ben Rady's Serverless Single Page Apps (The Pragmatic Programmers)
Until recently, IBM’s Power Platform was looked upon as being the system that hosted IBM’s flavor of UNIX and proprietary operating system called IBM i. These servers often are found in medium-size businesses running ERP, CRM and financials for on-premise customers. By enabling the Power platform to run the Linux OS, IBM now has positioned Power to be the platform of choice for those already running Linux that are facing scalability issues, especially customers looking at analytics, big data or cloud computing.
￼Running Linux on IBM’s Power hardware offers some obvious benefits, including improved processing speed and memory bandwidth, inherent security, and simpler deployment and management. But if you look beyond the impressive architecture, you’ll also find an open ecosystem that has given rise to a strong, innovative community, as well as an inventory of system and network management applications that really help leverage the benefits offered by running Linux on Power.Get the Guide