Filesystem Indexing with libferris
Jumping right in, I first create a full-text index in /tmp, add some files to it and then query for files using the index. First, I create a directory for the new index and use gfcreate to set up the new index in that directory:
$ mkdir /tmp/text-index $ gfcreate /tmp/text-index
The GUI for gfcreate shows the major MIME types in the leftmost tab, with a misc tab for things that can be created and that are considered distinct from MIME. After selecting misc, all the available index formats are shown in a second level of tabs. In Figure 1, I've chosen to create a Xapian full-text index using English for word stemming and ignoring word case.
When adding files to the full-text index, libferris attempts to use the as-text EA to obtain a textual representation of the file. Many plugins have been created supporting the as-text EA; PDF files, HTML files, man pages and djvu images all support as-text.
The findexadd and findexquery tools can be told which index to use with the -P command-line option. The following uses a PDF file and man page from the Samba 3.0.3 package as example input. As paths will vary depending on your Linux distribution, prefixes to the files have been replaced with /.../:
$ findexadd -P /tmp/text-index \ /.../samba-3.0.3/docs/Samba-HOWTO-Collection.pdf $ findexquery -P /tmp/text-index samba ID 1 99% [file://.../Samba-HOWTO-Collection.pdf] Found 1 matches at the following locations: file://.../Samba-HOWTO-Collection.pdf $ findexadd -P /tmp/text-index /.../samba.7.gz $ findexquery -P /tmp/text-index smbstatus ID 1 100% [file://.../Samba-HOWTO-Collection.pdf] ID 5 93% [file://.../samba.7.gz] Found 2 matches at the following locations: file://.../Samba-HOWTO-Collection.pdf file://.../samba.7.gz
The most interesting options for findexquery are the -P for setting the path to the index, the --ranked option for performing ranked full-text queries and --xapian for passing raw Xapian format queries to the back end (see Resources).
The default query format is Boolean. In this format, all alphanumeric words are looked up in the index and there are four Boolean operators that are used infix. These are & (and), | (or), ! (not) and - (minus). Ranked mode combines all terms and returns a list of documents that are the most interesting based on your query. In Xapian format, libferris hands the query directly to the back end for processing; currently, only the Xapian back end can handle such queries.
The procedure for adding to and querying EA indexes closely follows that of full-text indexing. EA indexes use the feaindexadd and feaindexquery commands, which both accept the -P /path-to-index option.
There are three parameters for overall tuning of EA indexes. These can be set when the EA index is created. They relate to the EA you are interested in indexing for your files. For example, you can create a lean EA index containing only filenames, sizes and some image properties for use in image file searching. You also may choose to ignore some EAs that take a while to calculate or that are not relevant to your search. For example, if you are not planning to use the index for integrity checking, ignoring the MD5, SHA-1 and other checksums saves considerable time, because these checksums require the entire file to be read for each file being added to the index.
The first of the general EA index parameters is the max-values-size-to-index parameter that defines the largest byte length for a value to be added to the index for any attribute. Most EAs should be fairly short values in the range of less than 100 bytes. The default is to be lenient and allow up to 1,024 bytes to be used by any individual EA value. The other two attributes are attributes-not-to-index and attributes-not-to-index-regex. These define the names of EAs to be ignored when files are being added to the index. There is a direct trade-off between indexing all EAs, which makes adding files slower but preserves all information for queries or indexing only a subset of EAs, which makes the adding faster but then some queries will not execute.
The defaults for the not-to-index parameters should allow files to be added fairly quickly but still allow many interesting EAs to be indexed. These defaults for these three parameters can be overridden by running the cc/capplets/index/ferris-capplet-index tool, which sets the defaults for new index creation and alters your ~/.ferris indexes for future file indexing.
For the EA index we use a PostgreSQL database to store the index. For seamless EA index creation, a minor setup step is required before running the fcreate tools. The PGSQL language must be enabled by default for new databases. The command below does this when run as root:
# createlang -d template1 plpgsql
If you don't want to change the template1 database, you can create the PostgreSQL database manually, enable plpgsql for the new database and append db-exists=1 to the fcreate command line below.
For PostgreSQL EA indexes, you also can set the user name, password, host, port and dbname for use by the PostgreSQL database. By default (db-exists=0) the database with name dbname must not exist and will be created for this new EA index.
There is another tweakable parameter for EA indexes that use a relational database as their implementation, allowing you to change how some EAs are normalized in the relational database. Once again, the default values should be acceptable. I explain this trade-off in a moment.
The extra-columns-to-inline-in-docmap gives a list of EAs that are so important to searching they should be denormalized into the docmap table. EAs that have a unique value for almost every file will be stored more efficiently inline in the docmap table. To denormalize an EA in this way, you must provide the SQL type for that EA as well.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- Dynamic DNS—an Object Lesson in Problem Solving
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Download the Free Red Hat White Paper "Using an Open Source Framework to Catch the Bad Guy"
- Tech Tip: Really Simple HTTP Server with Python
- Keeping track of IP address
59 min 3 sec ago - Roll your own dynamic dns
6 hours 12 min ago - Please correct the URL for Salt Stack's web site
9 hours 23 min ago - Android is Linux -- why no better inter-operation
11 hours 39 min ago - Connecting Android device to desktop Linux via USB
12 hours 7 min ago - Find new cell phone and tablet pc
13 hours 5 min ago - Epistle
14 hours 34 min ago - Automatically updating Guest Additions
15 hours 43 min ago - I like your topic on android
16 hours 29 min ago - This is the easiest tutorial
23 hours 5 min ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?





Comments
So So
The article is timely. Unfortunately Unless you are running a Fedora Core 3 system with an Athlon CPU you can't use the binaries, and the rpms (src type) don't compile. .... So if I were ever to get it all together....... might be able to do something.
packages
Please email the libferris mailing list the build failure logs for src.rpm files. Its always hard to fix what I can't see.
Also note that there is some initial support in gentoo for installing libferris. If anyone wants to maintain packages for debian, suse etc then I'll be very happy to hear from you.