Cross-Platform CD Index
End users need not worry about any of this. They simply can type words to search for on a Web page, and jsFind returns links to pages containing those keywords. No install, no worries, just a seamless experience.
As a developer of content, however, your life is not so simple. The jsFind toolset tries to make your job as easy as possible, though. To start, you need Perl and a fair amount of CPU time to generate the index. Most likely you also need a copy of all the target browsers so you can test the results. An example with a Makefile can be found in the jsFind distribution, but several steps need to be tailored to your individual needs.
The first step is to get a data set consisting of keywords and links; the input format needs to be XML. I used SWISH-E with a custom patch to extract and create an index and then exported the results to the XML format suitable for processing with jsFind's Perl scripts. Assuming the SWISH-E index is in the file mystuff.index, the following command exports the file to XML:
$ swish-e -f mystuff.index -T INDEX_XML > mystuff.xml
The structure of this XML file is as follows:
<index> <word> <name>akeywordhere</name> <path freq="11" title="Something neat"> /cdrom/blah.html </path> <path freq="10" title="More cool stuff"> /cdrom/blah2.html </path> </word> <word> ... </index>
The XML file is sorted by order of keyword name.
The resulting data set still is probably too large, because SWISH-E doesn't concern itself with filtering out words like “and”, “this” and other common English words. Two Perl programs can be used to filter the result, occurrences.pl and filter.pl. occurrences.pl creates a list of keywords and determines the number of times they occur in an index:
$ occurrences.pl mystuff.xml | sort -n -k 2 \ > mystuff.keys
This file has a keyword on each line followed by the number of occurrences:
$ tail mystuff.keys you 134910 for 138811 i 149471 in 168657 is 179815 of 252424 and 273283 a 299319 to 349069 the 646262
At this point, the mind-numbing task of creating a keyword exclusion file is performed. Edit the key file and leave in all the words that should be excluded from the final index. Even better than creating your own file, get a copy of the 300 most common words in English from ZingMan at www.zingman.com/commonWords.html.
Next, run the filter. The Perl script filter.pl included in this package filters a result set. It currently is set to exclude any single-character index keys (except the letter C), any key that starts with two numeric digits (so things like 3com and 0xe3 are okay) and anything in the specified exclusion file:
$ filter.pl mystuff.xml mystuff.keys > \ mystuff-filtered.xml
This step takes quite a bit of time. Make sure the final size of the file falls within the limits of the space available. The final index should be about 75% of the size of the filtered index. If it's too big, whittle it down to size with a longer keyword exclusion file.
The second big step is creating the index itself. A script is provided to break this index down into a set of B-tree XML files:
$ mkindex.pl mystuff-filtered.xml 25 blocksize: 20 keycount: 101958 depth: 4 blockcount: 5098 maximum keys: 194480 fill ratio: 0.524259563965446 bottom fill: 92698 Working: 11%
Parameters are the next thing to consider. The blockcount states how many B-tree blocks need to be created. Each block creates one key nodes file and one data nodes file, and one directory. If the total number of files and directories is too high, increase the blocksize until it fits. The depth shows the number of levels in the tree. If the blocksize gets too large, search times slow down, so bottom fill is how it is kept balanced. Once that number of keys is put in the bottom row, the bottom row is closed to further node creation, thus creating a balanced tree.
- Readers' Choice Awards 2014 Poll
- Give new life to old phones and tablets with these tips!
- Tech Tip: Really Simple HTTP Server with Python
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- Memory Ordering in Modern Microprocessors, Part I
- Source Code Scanners for Better Code
- RSS Feeds