How to Index Anything

 in
You probably have search on your web site, but how about a search engine for the man pages on your system or even your mail? Try this simple indexing package.
Indexing PDF Files

Up to now, we've talked only about indexing HTML, XML and text files. Here's a more-advanced example: indexing PDF documents from the Linux Documentation Project.

For SWISH-E to index arbitrary files, PDF or otherwise, we must convert the files to text, ideally resembling HTML or XML, and arrange to have SWISH-E index the results.

We could index the PDF files by converting each to a corresponding file on disk and then index those, but instead we'll use this opportunity to introduce a more flexible way to index data: SWISH-E's programmatic access method (Figure 2).

Figure 2. Indexing Arbitrary Data with an External Program and SWISH-E

To index the PDF files, start by creating a SWISH-E configuration file, calling it howto-pdf.conf and endowing it with the following contents:

# howto-pdf.conf
IndexDir      ./howto-pdf-prog.pl
               # prog file to hand us XML docs
IndexFile     ./howto-pdf.index
               # Index to create.
UseStemming   yes
MetaNames     swishtitle swishdocpath

Here, the IndexDir directive specifies what SWISH-E calls an external program that will return data about what is to be indexed, instead of a directory containing all the files. The UseStemming yes directive requests SWISH-E to stem words to their root forms before indexing and searching. Without stemming, searching for the word “runs” on a document containing the word “running” will not match. With stemming, SWISH-E recognizes that “runs” and “running” both have the same root, or stem word, and finds the document relevant.

Last in our configuration file, but certainly not least, is the MetaNames directive. This line adds a special ability to our index—the ability to search on only the titles or filenames of the files.

Now, let's write the external program to return information about the PDF files we're indexing. Conveniently, the SWISH-E source ships with an example module, pdf2xml.pm, which uses the xpdf package to convert PDF to XML, prefixed with appropriate headers for SWISH-E. We use this module, copied to ~/indices, in our external program howto-pdf-prog.pl:

#!/usr/bin/perl -w
use pdf2xml;
my @files =
    `find ../HOWTO-pdfs/ -name '*.pdf' -print`;
for (@files) {
    chomp();
    my $xml_record_ref = pdf2xml($_);
    # this is one XML file with a SWISH-E header
    print $$xml_record_ref;
}

Equipped with the SWISH-E configuration file and the external program above, let's build the index:

% swish-e -c howto-pdf.conf -S prog

The -S prog option tells SWISH-E to consider the IndexDir specified as a program that returns information about the data to be indexed. If you forget to include -S prog when using an external program with SWISH-E, you'll be indexing the external program itself, not the documents it describes.

When the PDF index is built, we can perform searches:

% swish-e -f howto-pdf.index -m 2 -w boot disk

We should get results similar to:

1000 ../HOWTO-pdfs/Bootdisk-HOWTO.pdf
      "Bootdisk-HOWTO.pdf" 127194
983 ../HOWTO-pdfs/Large-Disk-HOWTO.pdf
     "Large-Disk-HOWTO.pdf" 85280

The MetaNames directive also lets us search on the titles and paths of the PDF files:

% swish-e -f howto-pdf.index -w swishtitle=apache
% swish-e -f howto-pdf.index -w swishdocpath=linux

All corresponding combinations of searches are supported. For example:

% swish-e -f howto-pdf.index -w '(larry and wall)
    OR (swishdocpath=linux OR swishtitle=kernel)'

The quoting above is necessary to protect the parentheses from interpretation by the shell.

Indexing Man Pages

For our final example, we show how to make a useful and powerful index of man pages and how to use the SWISH::API Perl module to write a searching client for the index. Again, first write the configuration file:

# sman-index.conf
IndexFile ./sman.index
  # Index to create.
IndexDir  ./sman-index-prog.pl
IndexComments no
  # don't index text in comments
UseStemming yes
MetaNames     swishtitle desc sec
PropertyNames            desc sec

We've described most of these directives already, but we're defining some new MetaNames and introducing something called PropertyNames.

In a nutshell, MetaNames are what SWISH-E actually searches on. The default MetaName is swishdefault, and that's what is searched on when no MetaName is specified in a query. PropertyNames are fields that can be returned describing hits.

SWISH-E results normally are returned with several Auto Properties including swishtitle, swishdesc, swishrank and swishdocpath. The MetaNames directive in our configuration specifies that we want to be able to search independently not only on each whole document, but also on only the title, the description or the section. The PropertyNames line specifies that we want the sec and desc properties, the man page's section and short description, to be returned separately with each hit.

The work of converting the man pages to XML and wrapping it in headers for SWISH-E is performed in Listing 1 (sman-index-prog.pl).

Listing 1. sman-index-prog.pl converts man pages to XML for indexing.

#!/usr/bin/perl -w

use strict;
use File::Find;

my ($cnt, @files) = (0, get_man_files());
warn scalar @files, " man pages to index...\n";
for my $f (@files) {
    warn "processing $cnt\n" unless ++$cnt % 20;
    my ($hashref) = parse_man($f);
    my $xml = make_xml($hashref);
    my $size = length $xml; # NOTE: Fails if UTF
    print "Path-Name: $f\n",
       "Document-Type: XML*\n",
       "Content-Length: $size\n\n", $xml;
}

sub get_man_files {  # get english manfiles
     my @files;
     chomp(my $man_path = $ENV{MANPATH} ||
       `manpath` || '/usr/share/man');
     find( sub {
       my $n = $File::Find::name;
       push @files, $n
       if -f $n && $n =~ m!man/man.*\.!
    }, split /:/, $man_path );
    return @files;
}
sub make_xml { # output xml version of hash
    my ($metas) = @_; # escapes vals as side-effect
    my $xml = join ("\n",
    map { "<$_>" . escape($metas->{$_}) .
"</$_>" }
    keys %$metas);
    my $pre = qq{<?xml version="1.0"?>\n};
    return qq{$pre<all>$xml</all>\n};
}
sub escape { # modifies scalar you pass!
    return "" unless defined($_[0]);
    s/&/&amp;/g, s/</&lt;/g, s/>/&gt;/g for $_[0];
    return $_[0];
}

sub parse_man {   # this is the bulk
    my ($file) = @_;
    my ($manpage, $cur_content) = ('', '');
    my ($cur_section,%h) = qw(NOSECTION);
    open FH, "man $file  | col -b |"
    or die "Failed to run man: $!";
    my ($line1, $lineM) = (scalar(<FH>) || "", "");
    while ( <FH> ) {  # parse manpage into sections
       $line1 = $_ if $line1 =~ /^\s*$/;
       $manpage .= $lineM = $_ unless /^\s*$/;
       if (s/^(\w(\s|\w)+)// || s/^\s*(NAME)/$1/i){
          chomp( my $sec = $1 );  # section title
          $h{$cur_section} .= $cur_content;
          $cur_content = "";
          $cur_section = $sec; # new section name
       }
       $cur_content .= $_ unless /^\s*$/;
    }
    $h{$cur_section} .= $cur_content;

    # examine NAME, HEADer, FOOTer, (and
    # maybe the filename too).
    close(FH) or die "Failed close on pipe to man";
    @h{qw(A_AHEAD A_BFOOT)} = ($line1, $lineM);
    my ($mn, $ms, $md) =
("","","","");
    # NAME mn, DESCRIPTION md, & SECTION ms
    for(sort keys(%h)) { # A_AHEAD & A_BFOOT first
       my ($k, $v) = ($_, $h{$_}); # copy key&val
       if (/^A_(AHEAD|BFOOT)$/) { #get sec or cmd
           # look for the 'section' in ()'s
          if ($v =~ /\(([^)]+)\)\s*$/) {$ms||= $1;}
       } elsif($k =~ s/^\s*(NOSECTION|NAME)\s*//) {
          my $namestr = $v || $k; # 'cmd - a desc'
          if ($namestr =~ /(\S.*)\s+--?\s*(.*)/) {
             $mn ||= $1 || "";
             $md ||= $2 || "";
          } else { # that regex could fail.
             $md ||= $namestr || $v;
          }
       }
    }
    if (!$ms && $file =~ m!/man/man([^/]*)/!) {
       $ms = $1; # get sec from path if not found
    }
    ($mn = $file) =~ s!(^.*/)|(\.gz$)!! unless $mn;
    my %metas;
    @metas{qw(swishtitle sec desc page)} =
       ($mn, $ms, $md, $manpage);
    return ( \%metas ); # return ref to 5-key hash.
}

The first for loop in Listing 1 is the main loop of the program. It looks at each man page, parses it as needed, converts it to XML and wraps it in the appropriate headers for SWISH-E:

  • get_man_file() uses File::Find to traverse the man directories to find man page source files.

  • make_xml() and escape() together create XML from the hashref returned by parse_man().

  • parse_man() performs the nitty-gritty work of getting the relevant fields from the man page source.

Now that we've explained it, let's use it:

% swish-e -c sman-index.conf -S prog

When that's done, you can test the index as before, using swish-e's -w option.

As our final example, we discuss a Perl script that uses SWISH::API to use the index we just built to provide an improved version of the UNIX standby apropos. The code is included in Listing 2 (sman). Here's a brief rundown: lines 1-14 set things up and parse command-line options, lines 15-23 issue the query and do cursory error handling and lines 24-39 present the search results using Properties returned through the SWISH::API.

Listing 2. sman is a command-line utility to search man pages.

#!/usr/bin/perl -w

use strict;
use Getopt::Long qw(GetOptions);
use SWISH::API;

my ($max,$rankshow,$fileshow,$cnt) = (20,0,0,0);
my $index = "./sman.index";
GetOptions( "max=i"   => \$max,
             "index=s" => \$index,
             "rank"    => \$rankshow,
             "file"    => \$fileshow,
);
my $query = join(" ", @ARGV);
my $handle = SWISH::API->new($index);
my $results = $handle->Query( $query );
if ( $results->Hits() <= 0 ) {
    warn "No Results for '$query'.\n";
}
if ( my $error = $handle->Error( ) ) {
    warn "Error: ",  $handle->ErrorString(), "\n";
}
while ( ($cnt++ < $max) &&
(my $res = $results->NextResult)) {
    printf "%4d ", $res->Property( "swishrank" )
       if $rankshow;
    my $title = $res->Property( "swishtitle" );
    if (my $cmd = $res->Property( "cmd" )) {
       $title .= " [$cmd]";
    }
    printf "%-25s (%s) %-30s", $title,
       $res->Property( "sec" ),
       $res->Property( "desc" );
    printf " %s", $res->Property( "swishdocpath"
)
       if $fileshow;
    print "\n";
}

The Perl client is that simple. Let's use ours to issue searches on our man pages such as:

% ./sman -m 1 boot disk

We should get back:

bootparam (7) Introduction to boot time para...

But we now also can do searches like:

% ./sman sec=3 perl

to limit searches to section 3. The sman program also accepts the command-line option --max=# to specify the maximum number of hits returned, --file to show the source file of the man page and --rank to show each hit's rank for the given query:

% ./sman --max=1 --file --rank boot

This returns:

1000 lilo.conf (5) configuration file for lilo
    /usr/man/man5/lilo.conf.5

Notice the rank as the first column and the source file as the last one.

An enhanced version of the sman package will be available at joshr.com/src/sman/.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: How to Index Anything

Mengucino's picture

In this page you will find the way to do this. Also it has a lot of tips for webmaster and search engines.

Bye

Megucino from
______

Re: How to Index Anything

Anonymous's picture

Does anyone know how to indexing a Dinamic page? such as a php/jsp page?

Re: How to Index Anything

Vuelos's picture

For PHP you don´t have to do nothing expecial. The server will return to the bot HTML code not PHP or JSP code. The problem for a robot is to know if a page for example /pages.php?id=23 is diferent from /pages.php?id=24. The robot can´t index every page with different parameters so it must to implement an algorithm that allow to determine if pages are similar or equal and in this case it shouldn´t be indexed.

Re: How to Index Anything

augur's picture

Does anyone know how to indexing a Dynamic page? such as a php/jsp page?

If you spider the site (i.e. -S http) then you don't need to do anything special as long as the PHP/JSP code results in Text, HTML, or XML.
If you use FS method then, at least for PHP, you can have SWISH-E use the PHP cgi executable to process each document into Text, HTML, or XML. In your index configuration add something like this:

IndexContents HTML* .php
FileFilter .php /usr/bin/php "-q '%p'"

Can Swish deal separately with META?

Anonymous's picture

Can Swish deal separately with the META element? It would be very useful to be able to search arbitrary metadata such as authors, keywords or abstracts.

Re: Can Swish deal separately with META?

Anonymous's picture

yes, SWISH-E will automatically parse META tags in HTML/XML docs,
as per the current SWISH-E 2.4.0 documentation here.

SMAN Project RELEASED: search on man pages

Anonymous's picture

Hello All,
The SMAN project, is now publicly available from
http://joshr.com/src/sman.

SMAN is an enhanced version of the unix standbys 'man -k' and 'apropos,' as discussed in Josh Rabinowitz's "How To Index Anything" article in the July 2003 issue of Linux Journal.

Please test it out and let Josh know what you think!

From the SMAN README:

Sman is the Searcher for Man pages. Based on the example of the
same name in Josh Rabinowitz's article "How To Index Anything"
in the July, 2003 issue of Linux Journal
(http://www.linuxjournal.com/article.php?sid=6652), sman is
an enhanced version of 'apropos' and 'man -k'. Sman adds
several key abilities over its predecessors:

* Supports complex natural language text searches such as
"(linux and kernel) or (mach and microkernel)"

* Shows results in a ranked order

* Allows for searches by manpage section, title,
body, or filename

* Uses a prebuilt index to perform fast searches

* Performs 'stemming' so that a search for "searches"
will match a document with the word "searching"

Again, SMAN is available from available from
http://joshr.com/src/sman.

Posted on Tuesday, July 01, 2003?

Anonymous's picture

Posted on Tuesday, July 01, 2003?

Re: Posted on Tuesday, July 01, 2003?

Anonymous's picture

I tried the man page index example and got errors when I entered

swish-e -c sman-index.conf -S prog

I got many warnings like this:

Warning: Unknown header line: ...

Here are the first few and the last couple:

$ swish-e -c sman-index.conf -S prog
Indexing Data Source: "External-Program"
Indexing "./sman-index-prog.pl"
10373 man pages to index...

Warning: Unknown header line: 'll>' from program ./sman-index-prog.pl
:385: warning [p 2, 9.8i]: can't break line
:391: warning [p 2, 10.8i]: can't break line
:399: warning [p 3, 0.8i]: can't break line

Warning: Unknown header line: 'ntains spaces.' from program ./sman-index-prog.pl

Warning: Unknown header line: 'Tcl 8.1 Tcl(n)' from program ./sman-index-prog.pl

Warning: Unknown header line: '' from program ./sman-index-prog.pl

[snip]

Warning: Unknown header line: '>' from program ./sman-index-prog.pl

Warning: Unknown header line: '>' from program ./sman-index-prog.pl
err: External program failed to return required headers Path-Name: & Content-Length:
.

Re: setenv LANG C to work around UTF issues

Anonymous's picture

I was able to get around this by setting the environment variable LANG to "C" like this (adjust for your shell);

setenv LANG C

I think this only needs to be done before indexing with sman-update, and not for sman itself.

Re: Posted on Tuesday, July 01, 2003?

Anonymous's picture

The author says the code was tested on RH6.2, RH7.3, and Debian Woody. Maybe you made a typo, or you have multibyte man pages on your system (which the article and code mention that SWISH-E will gak on?)

I just tried the sman example above and it worked for me on RH6.2:

% swish-e -c sman-index.conf -S prog
Indexing Data Source: "External-Program"
Indexing "./sman-index-prog.pl"
4803 man pages to index...
processing 20
....

There's an enhanced version of SMAN in development at http://joshr.com/src/sman. This version should work better, since it's not shortened to fit in an article.

Re: swan from joshr.com still gives errors

Anonymous's picture

# rpm -q libxml2
libxml2-2.5.4-1
# uname -a
Linux localhost 2.4.20-8 #1 Thu Mar 13 17:54:28 EST 2003 i686 i686 i386 GNU/Linux

# sman-update --verbose --warn --debug

[snip maybe valuable information ?]

**==== END XML of /usr/share/man/mann/Tcl.n.gz =========

** working on /usr/share/man/mann/after.n.gz
** Running man /usr/share/man/mann/after.n.gz...

Warning: Unknown header line: 'd, even if the vari-' from program stdin

Warning: Unknown header line: 'able

Re: swan from joshr.com still gives errors

Anonymous's picture

Some people report that setting the environment variable LANG=en_US might help this issue. I've also heard that a new release of sman is coming that will make it easier to pinpoint the source of errors like this. Let us know if this works!

Re: sman from joshr.com still gives errors

Anonymous's picture

There is a new release at http://joshr.com/src/sman .
Please let us know if this solves your problem.

Sman rocks! Works for me!

Anonymous's picture

I've been using sman for a while on my systems with no problems. It even works on OS X now. There's a new version at http://joshr.com/src/sman

Sman Rocks, and it's on CPAN and Freshmeat

Anonymous's picture

You can now find the latest versions of Sman on Freshmeat at
http://freshmeat.net/projects/sman/
and on CPAN at
http://search.cpan.org/~joshr/Sman/

Re: How to Index Anything

Anonymous's picture

This is cool. But how does Google or Yahoo read through any file type for content. I have done searches for linux and these sites have returned pdf, word, html, excel, powerpoint, text, and even an microsoft project file. How can these sites run such massive searches?

Re: How to Index Anything

Anonymous's picture

Google is massive array of computers. That's why it is fast.
They have filters for those types. As long as you have filters, you can do it too.

sure, swish-e is not google,

Anonymous's picture

sure, swish-e is not google, and never will be, but

it can also index MsWord, OpenOffice, PDF, RTF (apart from standard xml, html, txt) - PPT filter is also available by now (see swish-e.org).

and last, but not least - I run swish-e on Windows and Linux too, almost everything described in this good article is possible with Windows version of swish-e (yes, you don't have man pages there :)

cheers

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix