Cross-Platform CD Index

CD-ROM content needs a search engine that can run in any browser, straight from static index files. JavaScript and XML make it possible.
Short Tutorial

End users need not worry about any of this. They simply can type words to search for on a Web page, and jsFind returns links to pages containing those keywords. No install, no worries, just a seamless experience.

Figure 2. Example Search Results

As a developer of content, however, your life is not so simple. The jsFind toolset tries to make your job as easy as possible, though. To start, you need Perl and a fair amount of CPU time to generate the index. Most likely you also need a copy of all the target browsers so you can test the results. An example with a Makefile can be found in the jsFind distribution, but several steps need to be tailored to your individual needs.

The first step is to get a data set consisting of keywords and links; the input format needs to be XML. I used SWISH-E with a custom patch to extract and create an index and then exported the results to the XML format suitable for processing with jsFind's Perl scripts. Assuming the SWISH-E index is in the file mystuff.index, the following command exports the file to XML:

$ swish-e -f mystuff.index -T INDEX_XML > mystuff.xml

The structure of this XML file is as follows:


<index>
 <word>
  <name>akeywordhere</name>
  <path freq="11" title="Something neat">
    /cdrom/blah.html
  </path>
  <path freq="10" title="More cool stuff">
    /cdrom/blah2.html
 </path>
 </word>
 <word>
 ...

</index>

The XML file is sorted by order of keyword name.

The resulting data set still is probably too large, because SWISH-E doesn't concern itself with filtering out words like “and”, “this” and other common English words. Two Perl programs can be used to filter the result, occurrences.pl and filter.pl. occurrences.pl creates a list of keywords and determines the number of times they occur in an index:

$ occurrences.pl mystuff.xml | sort -n -k 2 \
  > mystuff.keys

This file has a keyword on each line followed by the number of occurrences:

$ tail mystuff.keys
you 134910
for 138811
i 149471
in 168657
is 179815
of 252424
and 273283
a 299319
to 349069
the 646262

At this point, the mind-numbing task of creating a keyword exclusion file is performed. Edit the key file and leave in all the words that should be excluded from the final index. Even better than creating your own file, get a copy of the 300 most common words in English from ZingMan at www.zingman.com/commonWords.html.

Next, run the filter. The Perl script filter.pl included in this package filters a result set. It currently is set to exclude any single-character index keys (except the letter C), any key that starts with two numeric digits (so things like 3com and 0xe3 are okay) and anything in the specified exclusion file:

$ filter.pl mystuff.xml mystuff.keys > \
  mystuff-filtered.xml

This step takes quite a bit of time. Make sure the final size of the file falls within the limits of the space available. The final index should be about 75% of the size of the filtered index. If it's too big, whittle it down to size with a longer keyword exclusion file.

The second big step is creating the index itself. A script is provided to break this index down into a set of B-tree XML files:

$ mkindex.pl mystuff-filtered.xml 25
blocksize: 20
keycount: 101958
depth: 4
blockcount: 5098
maximum keys: 194480
fill ratio: 0.524259563965446
bottom fill: 92698
Working: 11%

Parameters are the next thing to consider. The blockcount states how many B-tree blocks need to be created. Each block creates one key nodes file and one data nodes file, and one directory. If the total number of files and directories is too high, increase the blocksize until it fits. The depth shows the number of levels in the tree. If the blocksize gets too large, search times slow down, so bottom fill is how it is kept balanced. Once that number of keys is put in the bottom row, the bottom row is closed to further node creation, thus creating a balanced tree.

If all works well, you should end up with three files in the current directory: 0.xml, _0.xml and the directory 0. These are the index files. The next step is to follow the provided example for integrating the results into your HTML/JavaScript. The results then are passed to a provided routine and need to be posted back to the current Web page. The example does this using JavaScript to create dynamic HTML.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

mkindex.pl ?

nolsen's picture

Where did you get mkindex.pl and filter.pl from?

The jsFind page (elucidsoft) is not reachable and Dobrica Pavlinusic's (Perl-)jsFind does not need and provide them.

jsfind link

nolsen's picture

just when i thought i'd failed....

telemetric's picture

if you get as i did this error after trying to run the binary apon installation:

swish-e: error while loading shared libraries: libswish-e.so.2: cannot open shared object file: No such file or directory

follow these directions:

add /usr/local/lib to /etc/ld.so.conf
# ldconfig
# swish-e -V

ref: http://swish-e.org/archive/2003-11/6385.html

:)

patching

telemetric_au's picture

having a go at this, and seems to be going ok, this is going to be sweet as have been moving into linux from windows for a bit now and this is one of last steps in my process of discarding windows due to not using it :)

anyway, about patching swish with the jsfind Patch

its a bit tricky as the links to the patchs on the jsFind page are broken, but you get one version of the patch file with the main download in the "download" directory...

but it doesnt go on automatically that well, in my case the closest verison match i could get was 2.4.0 swish and 2.4.0.2 patch...

if you use the patch command , 2 of the 4 changes will work but youll have to go in and do that later 2 by hand in a text editor or something a bit spiffy like "bluefish" :)

just read the patch file and then make the changes is outlines yourself, its not that hand once you interpret how a patch file is written and "works"

and way my latest attempt at swish compilation/installation is done and back at the promtpt so im out...

Re: Cross-Platform CD Index

Anonymous's picture

Wow, an article posted from the future!

Re: Cross-Platform CD Index

Anonymous's picture

Here's the Linux Journal archive CD which has this software on it.

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix