How to Index Anything
The first step in building an index with SWISH-E is writing a configuration file. Create a directory like ~/indices, cd into it and create the file ./howto-html.conf with the following contents:
# howto-html.conf IndexDir ../HOWTO-htmls/ IndexOnly .html IndexFile ./howto-html.index
The IndexDir directive specifies the directory in which SWISH-E should look for files to be indexed. The IndexOnly directive requests that only files ending in .html be indexed. Finally, the location of the index to be created is specified with the IndexFile directive.
Now, let's build our index of HTML files with the command:
% swish-e -c howto-html.conf
The -c option specifies which SWISH-E configuration file to use. On an older system, building this index may take a few minutes or so; on a contemporary one, it should take under a minute. Figure 1 illustrates the process of indexing HTML files on the filesystem with SWISH-E.

Figure 1. Indexing HTML on the Filesystem with SWISH-E
Let's test our first index by doing a simple search for HTML files relevant to the term NFS. You can test SWISH-E indices quickly using the swish-e executable by specifying an index with the -f option, and the text to be searched with the -w option; searches on SWISH-E indices are case-insensitive. Because we expect a lot of pages (or hits) to include the word NFS, we use the -m 3 option to request only three:
% swish-e -f howto-html.index -m 3 -w nfs
This returns (abridged and reformatted):
1000 ../HOWTO-htmls/NFS-HOWTO/performance.html
"Optimizing NFS Performance" 33288
998 ../HOWTO-htmls/NFS-HOWTO/intro.html
"Introduction" 10966
993 ../HOWTO-htmls/NFS-HOWTO/security.html
"Security and NFS" 35968Not bad—those pages are definitely about NFS, and the output is intuitive. The first column is the rank SWISH-E gives each hit—the hits considered most relevant always are ranked 1000, with less-relevant files ranked in descending order. The second column shows the name of the file, the third gives the page's title and the fourth shows the byte count of the indexed data. SWISH-E determines the title of each page from the HTML tags in each file using one of its HTML parsing engines.
The built-in SWISH-E parsing engines are called TXT, HTML and XML, and each is designed to parse the corresponding type of content. Recent versions of SWISH-E also can use the libxml2 library for the HTML2 and XML2 parsing back ends. Both the XML2 and HTML2 parsers are preferable to their built-in counterparts—especially HTML2. This is why a recent version of libxml2, though technically optional when building SWISH-E, probably should be considered a prerequisite.
SWISH-E supports a full-featured text retrieval search language with syntax including AND, OR, NOT and parenthetic grouping that all work predictably. For example, the following searches all have the expected semantics:
% swish-e -f howto-html.index -w nfs AND tcp
% swish-e -f howto-html.index -w nfs OR tcp
% swish-e -f howto-html.index \
-w '(gandalf OR frodo) OR (lord AND rings)'SWISH-E configuration files are simple text files in which each line is either a directive or a comment. Any line in which the first non-whitespace character is a # is ignored by SWISH-E as a comment. All other non-empty lines should be in the form:
Directive Options [Options] ...
If you need to specify an option with spaces embedded, you can use quotation marks:
Directive "Option With Spaces!"
If the option has single quotation marks within it, you can quote it with the double quote character and vice versa, for example:
Directive "Fred's Index Option" Directive 'By Josh "joshr" Rabinowitz'
Dozens of directives can be applied to SWISH-E configuration files. An exhaustive reference can be found in the SWISH-E documentation.
Each SWISH-E index is stored in a pair of files. One is named as specified in the IndexFile directive, and the other is called indexname.prop. When talking about a SWISH-E index, we mean this pair of files.
The indices can get large. In our example index of HTML files, the index occupies about 11MB, about one-fourth the size of the original files indexed.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- Nice article, thanks for the
45 min 54 sec ago - I once had a better way I
6 hours 31 min ago - Not only you I too assumed
6 hours 49 min ago - another very interesting
8 hours 42 min ago - Reply to comment | Linux Journal
10 hours 35 min ago - Reply to comment | Linux Journal
17 hours 29 min ago - Reply to comment | Linux Journal
17 hours 45 min ago - Favorite (and easily brute-forced) pw's
19 hours 37 min ago - Have you tried Boxen? It's a
1 day 1 hour ago - seo services in india
1 day 6 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
Re: How to Index Anything
In this page you will find the way to do this. Also it has a lot of tips for webmaster and search engines.
Bye
Megucino from
______
Re: How to Index Anything
Does anyone know how to indexing a Dinamic page? such as a php/jsp page?
Re: How to Index Anything
For PHP you don´t have to do nothing expecial. The server will return to the bot HTML code not PHP or JSP code. The problem for a robot is to know if a page for example /pages.php?id=23 is diferent from /pages.php?id=24. The robot can´t index every page with different parameters so it must to implement an algorithm that allow to determine if pages are similar or equal and in this case it shouldn´t be indexed.
Re: How to Index Anything
If you spider the site (i.e. -S http) then you don't need to do anything special as long as the PHP/JSP code results in Text, HTML, or XML.
If you use FS method then, at least for PHP, you can have SWISH-E use the PHP cgi executable to process each document into Text, HTML, or XML. In your index configuration add something like this:
IndexContents HTML* .php
FileFilter .php /usr/bin/php "-q '%p'"
Can Swish deal separately with META?
Can Swish deal separately with the META element? It would be very useful to be able to search arbitrary metadata such as authors, keywords or abstracts.
Re: Can Swish deal separately with META?
yes, SWISH-E will automatically parse META tags in HTML/XML docs,
as per the current SWISH-E 2.4.0 documentation here.
SMAN Project RELEASED: search on man pages
Hello All,
The SMAN project, is now publicly available from
http://joshr.com/src/sman.
SMAN is an enhanced version of the unix standbys 'man -k' and 'apropos,' as discussed in Josh Rabinowitz's "How To Index Anything" article in the July 2003 issue of Linux Journal.
Please test it out and let Josh know what you think!
From the SMAN README:
Sman is the Searcher for Man pages. Based on the example of the
same name in Josh Rabinowitz's article "How To Index Anything"
in the July, 2003 issue of Linux Journal
(http://www.linuxjournal.com/article.php?sid=6652), sman is
an enhanced version of 'apropos' and 'man -k'. Sman adds
several key abilities over its predecessors:
* Supports complex natural language text searches such as
"(linux and kernel) or (mach and microkernel)"
* Shows results in a ranked order
* Allows for searches by manpage section, title,
body, or filename
* Uses a prebuilt index to perform fast searches
* Performs 'stemming' so that a search for "searches"
will match a document with the word "searching"
Again, SMAN is available from available from
http://joshr.com/src/sman.
Posted on Tuesday, July 01, 2003?
Posted on Tuesday, July 01, 2003?
Re: Posted on Tuesday, July 01, 2003?
I tried the man page index example and got errors when I entered
swish-e -c sman-index.conf -S prog
I got many warnings like this:
Warning: Unknown header line: ...
Here are the first few and the last couple:
$ swish-e -c sman-index.conf -S prog
Indexing Data Source: "External-Program"
Indexing "./sman-index-prog.pl"
10373 man pages to index...
Warning: Unknown header line: 'll>' from program ./sman-index-prog.pl
:385: warning [p 2, 9.8i]: can't break line
:391: warning [p 2, 10.8i]: can't break line
:399: warning [p 3, 0.8i]: can't break line
Warning: Unknown header line: 'ntains spaces.' from program ./sman-index-prog.pl
Warning: Unknown header line: 'Tcl 8.1 Tcl(n)' from program ./sman-index-prog.pl
Warning: Unknown header line: '' from program ./sman-index-prog.pl
[snip]
Warning: Unknown header line: '>' from program ./sman-index-prog.pl
Warning: Unknown header line: '>' from program ./sman-index-prog.pl
err: External program failed to return required headers Path-Name: & Content-Length:
.
Re: setenv LANG C to work around UTF issues
I was able to get around this by setting the environment variable LANG to "C" like this (adjust for your shell);
setenv LANG C
I think this only needs to be done before indexing with sman-update, and not for sman itself.
Re: Posted on Tuesday, July 01, 2003?
The author says the code was tested on RH6.2, RH7.3, and Debian Woody. Maybe you made a typo, or you have multibyte man pages on your system (which the article and code mention that SWISH-E will gak on?)
I just tried the sman example above and it worked for me on RH6.2:
% swish-e -c sman-index.conf -S prog
Indexing Data Source: "External-Program"
Indexing "./sman-index-prog.pl"
4803 man pages to index...
processing 20
....
There's an enhanced version of SMAN in development at http://joshr.com/src/sman. This version should work better, since it's not shortened to fit in an article.
Re: swan from joshr.com still gives errors
# rpm -q libxml2
libxml2-2.5.4-1
# uname -a
Linux localhost 2.4.20-8 #1 Thu Mar 13 17:54:28 EST 2003 i686 i686 i386 GNU/Linux
# sman-update --verbose --warn --debug
[snip maybe valuable information ?]
**==== END XML of /usr/share/man/mann/Tcl.n.gz =========
** working on /usr/share/man/mann/after.n.gz
** Running man /usr/share/man/mann/after.n.gz...
Warning: Unknown header line: 'd, even if the vari-' from program stdin
Warning: Unknown header line: 'able
Re: swan from joshr.com still gives errors
Some people report that setting the environment variable LANG=en_US might help this issue. I've also heard that a new release of sman is coming that will make it easier to pinpoint the source of errors like this. Let us know if this works!
Re: sman from joshr.com still gives errors
There is a new release at http://joshr.com/src/sman .
Please let us know if this solves your problem.
Sman rocks! Works for me!
I've been using sman for a while on my systems with no problems. It even works on OS X now. There's a new version at http://joshr.com/src/sman
Sman Rocks, and it's on CPAN and Freshmeat
You can now find the latest versions of Sman on Freshmeat at
http://freshmeat.net/projects/sman/
and on CPAN at
http://search.cpan.org/~joshr/Sman/
Re: How to Index Anything
This is cool. But how does Google or Yahoo read through any file type for content. I have done searches for linux and these sites have returned pdf, word, html, excel, powerpoint, text, and even an microsoft project file. How can these sites run such massive searches?
Re: How to Index Anything
Google is massive array of computers. That's why it is fast.
They have filters for those types. As long as you have filters, you can do it too.
sure, swish-e is not google,
sure, swish-e is not google, and never will be, but
it can also index MsWord, OpenOffice, PDF, RTF (apart from standard xml, html, txt) - PPT filter is also available by now (see swish-e.org).
and last, but not least - I run swish-e on Windows and Linux too, almost everything described in this good article is possible with Windows version of swish-e (yes, you don't have man pages there :)
cheers