Cross-Platform CD Index

CD-ROM content needs a search engine that can run in any browser, straight from static index files. JavaScript and XML make it possible.

I recently was working on a CD-ROM catalog for a client, and he requested that it have keyword search ability. My searches for solutions to such a request kept turning up proprietary OS software that required an install on the user's machine and a license fee paid per copy distributed. Such installation requirements are limiting and would cost a lot over time. Furthermore, all of the CD-ROM users were not going to be using a single proprietary OS, so this automatically reduced the potential customer base. While sitting back to think about the situation, a package in my mailbox caught my eye—the Linux Journal Archive CD. I figured if anybody had solved this problem, it was sure to be on the LJ Archive. Imagine my disappointment upon discovering that the LJ Archive CD has a really good index but no search engine. If a solution was to be found, I would have to find it myself. This article is about scratching that proverbial itch with jsFind.

Licensing

One of my earliest considerations was how to distribute and license my solution, jsFind. I showed early versions of it to colleagues, and they felt I should follow the model in which I license the code and then market it. jsFind then would be using the same model as other competing search engines for this type of content. Personally, I would rather spend my time coding than marketing, and I suspect the total market is not very large. I would rather get informative CD-ROMs and be able to search them easily using any browser and platform I choose.

The GNU Public License (GPL) was more in line with my goals. By freely distributing jsFind, it would be marketed based on its own merits, gaining improvements and contributions as it grows. At the risk of preaching to the choir, one of the goals of proprietary systems is to lock users in to being required to use their system by every possible means. For example, when one gets a CD-ROM and is required to use a specific browser and a specific OS to use the search engine, then that user is forced to access a copy of that OS. CD-ROM producers also are forced to keep buying development tools for that OS in order to stay current. The result is consumers and producers are locked in to the proprietary OS vendor. Releasing jsFind under the GPL would break the cycle.

How It Was Done

The jsFind keyword search engine itself is a small JavaScript program of about 500 lines. A browser that supports DOM Level 3 JavaScript extensions can load XML files. The current versions of Mozilla, Netscape and Microsoft Internet Explorer all support these extensions, and the upcoming release of Konqueror will do so as well. The index is stored as a set of XML files, and the JavaScript searches through these in an efficient manner to generate results of a keyword search. These results then can be posted back to the Web page that requested them, also using JavaScript.

One of the key dependencies of jsFind is that a CD-ROM be a set of static information. Unlike Web search engines or any other dynamic data set, once pressed, a CD-ROM isn't going to change. SWISH-E is better suited for dynamic indexing, especially when one has the luxury of configuring a server to do the keyword searches. Therefore, jsFind is based on the idea that the only thing available is a standard Web browser with JavaScript and a set of browseable files—a severe restriction on possible solutions.

Most indexing method algorithms try to strike a balance between insert, update, delete and select times. Because a CD-ROM is static, there will never be a delete or update. Insert takes place prior to CD burning and can be quite time consuming. Select time is critical for user responsiveness. An additional constraint of small space is required, because a typical CD-ROM can't hold more than 700MB.

Re-examining indexing methods based on these constraints yielded an interesting solution: B-trees and hashes are the two most commonly used indexing methods. I chose to use B-trees due to the fact that a filesystem organizes files in a tree; this could be used to store the structure of the B-tree, saving some precious space in the process. Second, the key/link pairs could be analyzed, and a balanced B-tree could be created. The structure of the XML files themselves was kept as minimal as possible, so single-letter tags were used as a space-saving move.

Description of B-trees

A B-tree is a data structure used frequently in database indexing and storage routines. It offers efficient search times, and storage/retrieval is done in blocks that works well with current hardware. A B-tree consists of nodes (or blocks) that have an ordered list of keys. Each key references an associated data set. If a requested key falls between two keys in the ordering, a reference is provided to another node of keys. A balanced B-tree is one in which the maximum number of nodes that could be loaded on a search stays at a minimum.

jsFind creates a B-tree by using XML files for the nodes of the tree, and the directories on the filesystem correspond to references to another set of nodes. This allows for part of the structure of the B-tree to be encoded on the filesystem. If all the XML files are in the same directory, file open times might become long, so using the filesystem efficiently requires subdirectories.

Figure 1. jsFind creates a B-tree, where an XML file represents each node.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

mkindex.pl ?

nolsen's picture

Where did you get mkindex.pl and filter.pl from?

The jsFind page (elucidsoft) is not reachable and Dobrica Pavlinusic's (Perl-)jsFind does not need and provide them.

jsfind link

nolsen's picture

just when i thought i'd failed....

telemetric's picture

if you get as i did this error after trying to run the binary apon installation:

swish-e: error while loading shared libraries: libswish-e.so.2: cannot open shared object file: No such file or directory

follow these directions:

add /usr/local/lib to /etc/ld.so.conf
# ldconfig
# swish-e -V

ref: http://swish-e.org/archive/2003-11/6385.html

:)

patching

telemetric_au's picture

having a go at this, and seems to be going ok, this is going to be sweet as have been moving into linux from windows for a bit now and this is one of last steps in my process of discarding windows due to not using it :)

anyway, about patching swish with the jsfind Patch

its a bit tricky as the links to the patchs on the jsFind page are broken, but you get one version of the patch file with the main download in the "download" directory...

but it doesnt go on automatically that well, in my case the closest verison match i could get was 2.4.0 swish and 2.4.0.2 patch...

if you use the patch command , 2 of the 4 changes will work but youll have to go in and do that later 2 by hand in a text editor or something a bit spiffy like "bluefish" :)

just read the patch file and then make the changes is outlines yourself, its not that hand once you interpret how a patch file is written and "works"

and way my latest attempt at swish compilation/installation is done and back at the promtpt so im out...

Re: Cross-Platform CD Index

Anonymous's picture

Wow, an article posted from the future!

Re: Cross-Platform CD Index

Anonymous's picture

Here's the Linux Journal archive CD which has this software on it.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState