How to Index Anything
You might want to build custom indices of documents for many reasons. A widely cited one is to supply search functionality to a web site, but you also may want to index your e-mail or technical documents. Anyone who has looked into implementing such a functionality has probably found it's not as easy as it might seem. Various factors conspire to make searching difficult.
The venerable and indispensable grep and its ilk are effective for scanning through lines of text. But grep, egrep and their relations won't do everything for you. They won't search across lines, they won't show search results in a ranked order and their linear search algorithms don't lend themselves to searching larger volumes of data.
HTML doesn't help the situation either. Its display-oriented features, idiosyncratic grammar and bevy of formatting and entity tags make it fairly difficult to parse correctly.
At the other end of the data storage spectrum is data slotted into a database. The ubiquitous example is that of the SQL database, which allows somewhat sophisticated search facilities but usually is not particularly fast for searching. Some database engines, notably MySQL 4, address this issue by allowing fast and ranked searches, but they may not be as customizable as desired.
In this article, we explore ways to create custom indices using SWISH-E, Perl and XML on Linux. Through examples, we show how SWISH-E can be used to build indices of HTML files, PDF files and man pages.
SWISH-E (simple web indexing system for humans—enhanced) is a descendant of SWISH, which was created in 1994 by Kevin Hughes. SWISH was transferred in 1996 to the UC Berkeley Library to fix bugs and add features, and the result was licensed under the GPL and renamed SWISH-E. Development continues, spearheaded by current project maintainer Bill Moseley and assisted by a team of developers.
Here at SkateboardDirectory.com, we happened upon SWISH-E when researching indexing toolkits. We found that it offers a unique combination of features that make it attractive for our purposes. Not only does SWISH-E offer a fast and robust toolkit with which to build and query indices, but it is also well documented, undergoes active development and bug fixes and includes a Perl interface. We also liked that maintainer Moseley and other experienced SWISH-E users and developers are usually prompt when addressing questions and bugs brought up on the SWISH-E mailing list.
For our examples, we started with a stock Red Hat 7.3 workstation with the Software Development bundle of packages installed. We also tested the examples on a Red Hat 6.2 workstation and a Debian Woody.
Currently, installing SWISH-E on Red Hat means installing from source, and the zlib and libxml2 libraries are required to build SWISH-E fully. If you find you need to install either, you probably can find packages provided with your distribution. We also use the xpdf package in our examples, so you may want to install that now if it isn't already. Our reference Red Hat 7.3 workstation setup had all of SWISH-E's prerequisites installed.
Here, we describe the use of SWISH-E 2.4, which according to the development team, should be released by the time you read this article. You can fetch and set up SWISH-E with the following sequence of commands, substituting the current version for (x.x):
% wget \ http://swish-e.org/Download/swish-e-x.x.tar.gz % tar zxf swish-e-x.x.tar.gz % cd swish-e-x.x % ./configure % make % make test
To install the SWISH-E binary, C libraries and man pages into their default locations in /usr/local, type make install as root. This installs the SWISH-E executable into /usr/local/bin. If this directory isn't in your PATH, either change your appropriate dot file to include /usr/local/bin in your PATH, or always call the swish-e executable by full pathname, like /usr/local/bin/swish-e.
Now, let's build and install the SWISH::API Perl module from the Perl directory in the source. We'll need it later when we build a Perl client for our index of man pages. SWISH::API is set up by the normal Perl module install process:
% cd perl % perl Makefile.PL % make % make test
Then, install the SWISH-E Perl module by typing make install as root.
Now that SWISH-E and the SWISH::API Perl module are installed fully, let's build a simple index of HTML files to test SWISH-E. For this example, we index the HTML, one-page-per-section versions of the Linux Documentation Project (LDP) HOWTOs, which we've unpacked into ~/HOWTO-htmls/. The tarballs of LDP documents used in this article come from www.tldp.org/docs.html.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- New Products
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- Validate an E-Mail Address with PHP, the Right Way
- New Products
- Developer Poll
- Tech Tip: Really Simple HTTP Server with Python
- git-annex assistant
1 min 41 sec ago - direct cable connection
24 min 11 sec ago - Agreed on AirDroid. With my
34 min 27 sec ago - I just learned this
38 min 37 sec ago - enterprise
1 hour 8 min ago - not living upto the mobile revolution
4 hours ago - Deceptive Advertising and
4 hours 35 min ago - Let\'s declare that you have
4 hours 36 min ago - Alterations in Contest Due
4 hours 37 min ago - At a numbers mindset, your
4 hours 38 min ago
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




Comments
Re: How to Index Anything
In this page you will find the way to do this. Also it has a lot of tips for webmaster and search engines.
Bye
Megucino from
______
Re: How to Index Anything
Does anyone know how to indexing a Dinamic page? such as a php/jsp page?
Re: How to Index Anything
For PHP you don´t have to do nothing expecial. The server will return to the bot HTML code not PHP or JSP code. The problem for a robot is to know if a page for example /pages.php?id=23 is diferent from /pages.php?id=24. The robot can´t index every page with different parameters so it must to implement an algorithm that allow to determine if pages are similar or equal and in this case it shouldn´t be indexed.
Re: How to Index Anything
If you spider the site (i.e. -S http) then you don't need to do anything special as long as the PHP/JSP code results in Text, HTML, or XML.
If you use FS method then, at least for PHP, you can have SWISH-E use the PHP cgi executable to process each document into Text, HTML, or XML. In your index configuration add something like this:
IndexContents HTML* .php
FileFilter .php /usr/bin/php "-q '%p'"
Can Swish deal separately with META?
Can Swish deal separately with the META element? It would be very useful to be able to search arbitrary metadata such as authors, keywords or abstracts.
Re: Can Swish deal separately with META?
yes, SWISH-E will automatically parse META tags in HTML/XML docs,
as per the current SWISH-E 2.4.0 documentation here.
SMAN Project RELEASED: search on man pages
Hello All,
The SMAN project, is now publicly available from
http://joshr.com/src/sman.
SMAN is an enhanced version of the unix standbys 'man -k' and 'apropos,' as discussed in Josh Rabinowitz's "How To Index Anything" article in the July 2003 issue of Linux Journal.
Please test it out and let Josh know what you think!
From the SMAN README:
Sman is the Searcher for Man pages. Based on the example of the
same name in Josh Rabinowitz's article "How To Index Anything"
in the July, 2003 issue of Linux Journal
(http://www.linuxjournal.com/article.php?sid=6652), sman is
an enhanced version of 'apropos' and 'man -k'. Sman adds
several key abilities over its predecessors:
* Supports complex natural language text searches such as
"(linux and kernel) or (mach and microkernel)"
* Shows results in a ranked order
* Allows for searches by manpage section, title,
body, or filename
* Uses a prebuilt index to perform fast searches
* Performs 'stemming' so that a search for "searches"
will match a document with the word "searching"
Again, SMAN is available from available from
http://joshr.com/src/sman.
Posted on Tuesday, July 01, 2003?
Posted on Tuesday, July 01, 2003?
Re: Posted on Tuesday, July 01, 2003?
I tried the man page index example and got errors when I entered
swish-e -c sman-index.conf -S prog
I got many warnings like this:
Warning: Unknown header line: ...
Here are the first few and the last couple:
$ swish-e -c sman-index.conf -S prog
Indexing Data Source: "External-Program"
Indexing "./sman-index-prog.pl"
10373 man pages to index...
Warning: Unknown header line: 'll>' from program ./sman-index-prog.pl
:385: warning [p 2, 9.8i]: can't break line
:391: warning [p 2, 10.8i]: can't break line
:399: warning [p 3, 0.8i]: can't break line
Warning: Unknown header line: 'ntains spaces.' from program ./sman-index-prog.pl
Warning: Unknown header line: 'Tcl 8.1 Tcl(n)' from program ./sman-index-prog.pl
Warning: Unknown header line: '' from program ./sman-index-prog.pl
[snip]
Warning: Unknown header line: '>' from program ./sman-index-prog.pl
Warning: Unknown header line: '>' from program ./sman-index-prog.pl
err: External program failed to return required headers Path-Name: & Content-Length:
.
Re: setenv LANG C to work around UTF issues
I was able to get around this by setting the environment variable LANG to "C" like this (adjust for your shell);
setenv LANG C
I think this only needs to be done before indexing with sman-update, and not for sman itself.
Re: Posted on Tuesday, July 01, 2003?
The author says the code was tested on RH6.2, RH7.3, and Debian Woody. Maybe you made a typo, or you have multibyte man pages on your system (which the article and code mention that SWISH-E will gak on?)
I just tried the sman example above and it worked for me on RH6.2:
% swish-e -c sman-index.conf -S prog
Indexing Data Source: "External-Program"
Indexing "./sman-index-prog.pl"
4803 man pages to index...
processing 20
....
There's an enhanced version of SMAN in development at http://joshr.com/src/sman. This version should work better, since it's not shortened to fit in an article.
Re: swan from joshr.com still gives errors
# rpm -q libxml2
libxml2-2.5.4-1
# uname -a
Linux localhost 2.4.20-8 #1 Thu Mar 13 17:54:28 EST 2003 i686 i686 i386 GNU/Linux
# sman-update --verbose --warn --debug
[snip maybe valuable information ?]
**==== END XML of /usr/share/man/mann/Tcl.n.gz =========
** working on /usr/share/man/mann/after.n.gz
** Running man /usr/share/man/mann/after.n.gz...
Warning: Unknown header line: 'd, even if the vari-' from program stdin
Warning: Unknown header line: 'able
Re: swan from joshr.com still gives errors
Some people report that setting the environment variable LANG=en_US might help this issue. I've also heard that a new release of sman is coming that will make it easier to pinpoint the source of errors like this. Let us know if this works!
Re: sman from joshr.com still gives errors
There is a new release at http://joshr.com/src/sman .
Please let us know if this solves your problem.
Sman rocks! Works for me!
I've been using sman for a while on my systems with no problems. It even works on OS X now. There's a new version at http://joshr.com/src/sman
Sman Rocks, and it's on CPAN and Freshmeat
You can now find the latest versions of Sman on Freshmeat at
http://freshmeat.net/projects/sman/
and on CPAN at
http://search.cpan.org/~joshr/Sman/
Re: How to Index Anything
This is cool. But how does Google or Yahoo read through any file type for content. I have done searches for linux and these sites have returned pdf, word, html, excel, powerpoint, text, and even an microsoft project file. How can these sites run such massive searches?
Re: How to Index Anything
Google is massive array of computers. That's why it is fast.
They have filters for those types. As long as you have filters, you can do it too.
sure, swish-e is not google,
sure, swish-e is not google, and never will be, but
it can also index MsWord, OpenOffice, PDF, RTF (apart from standard xml, html, txt) - PPT filter is also available by now (see swish-e.org).
and last, but not least - I run swish-e on Windows and Linux too, almost everything described in this good article is possible with Windows version of swish-e (yes, you don't have man pages there :)
cheers