Reading File Metadata with extract and libextractor

Don't just guess about a file's characteristics in a search. Use specific extractor plugins to build an accurate database of files.

Modern file formats have provisions to annotate the contents of the file with descriptive information. This development is driven by the need to find a better way to organize data than merely by using filenames. The problem with such metadata is it is not stored in a standardized manner across different file formats. This makes it difficult for format-agnostic tools, such as file managers or file-sharing applications, to make use of the information. It also results in a plethora of format-specific tools used to extract the metadata, such as AVInfo, id3edit, jpeginfo and Vocoditor.

In this article, the libextractor library and the extract tool are introduced. The goal of the libextractor Project is to provide a uniform interface for obtaining metadata from different file formats. libextractor currently is used by evidence, the file manager for the forthcoming version of Enlightenment, as well as for GNUnet, an anonymous, censorship-resistant peer-to-peer file-sharing system. The extract tool is a command-line interface to the library. libextractor is licensed under the GNU General Public License.

libextractor shares some similarities with the popular file tool, which uses the first bytes in a file to guess the MIME type. libextractor differs from file in that it tries to obtain much more information than the MIME type. Depending on the file format, libextractor can obtain additional information, including the name of the software used to create the file, the author, descriptions, album titles, image dimensions or the duration of a movie.

libextractor achieves this information by using specific parser code for many popular formats. The list currently includes MP3, Ogg, Real Media, MPEG, RIFF (avi), GIF, JPEG, PNG, TIFF, HTML, PDF, PostScript, Zip, OpenOffice.org, StarOffice, Microsoft Office, tar, DVI, man, Deb, elf, RPM, asf, as well as generic methods such as MIME-type detection. Many other formats exist, and among the more popular formats only a few proprietary formats are not supported.

Integrating support for new formats is easy, because libextractor uses plugins to gather data. libextractor plugins are shared libraries that typically provide code to parse one particular format. At the end of this article, we demonstrate how to integrate support for new formats into the library. libextractor gathers the metadata obtained from various plugins and provides clients with a list of pairs, consisting of a classification and a character sequence. The classification is used to organize the metadata into categories such as title, creator, subject and description.

Installing libextractor and Using extract

The simplest way to install libextractor is to use one of the binary packages available for many distributions. Under Debian, the extract tool is in a separate package, extract. Headers required to compile other applications against libextractor are contained in libextractor0-devel. If you want to compile libextractor from source, you need an unusual amount of memory: 256MB of system memory is roughly the minimum, as GCC uses about 200MB to compile one of the plugins. Otherwise, compiling by hand follows the usual sequence of steps, as shown in Listing 1.

After installing libextractor, the extract tool can be used to obtain metadata from documents. By default, the extract tool uses a canonical set of plugins, which consists of all file-format-specific plugins supported by the current version of libextractor, together with the mime-type detection plugin. Example output for the Linux Journal Web site is shown in Listing 2.

If you are a user of BibTeX, the option -b is likely to come in handy to create BibTeX entries automatically from documents that have been equipped properly with metadata, as shown in Listing 3.

Another interesting option is -B LANG. This option loads one of the language-specific but format-agnostic plugins. These plugins attempt to find plain text in a document by matching strings in the document against a dictionary. If the need for 200MB of memory to compile libextractor seems mysterious, the answer lies in these plugins. In order to perform a fast dictionary search, a bloomfilter is created that allows fast probabilistic matching; GCC finds the resulting data structure a bit hard to swallow.

The option -B is useful for formats that currently are undocumented or unsupported. The printable plugins typically print the entire text of the document in order. Listing 4 shows the output of extract run on a Microsoft Word document.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Test this tool online

Anonymous's picture

Online metadata reader is using libextractor. Might be handy if someone wants to test the results first without installing libextractor.

Can this extract index

Anonymous's picture

Can this extract index information of PDF files?

Extracting titles from word documents on linux

Anonymous's picture

WORD DOCS ARE SUPPORTED

A quick scan of the examples on this page initally made it seem to me as if the extract program does not support Microsoft Word Documents.

Closer inspection reveals that extracting metadata from Office documents is supported.


[foo@localhost ~]$ extract foo.doc
mimetype - application/vnd.ms-files
os - Win32
organization - Foo Publishing
page count - 1
modification date - Tue Sep 6 16:10:00 2005
software - Microsoft Office Word
version - 3
format - ABC123
keywords - SCADA, Cryptographic Protection, Communications
author - ABC123 Task Group
subject - Cryptographic Protection of SCADA Communications
title - ABC123 Draft 3
[foo@localhost ~]$

Missing strdup()?

Aron Stansvik's picture

"The strdup in the code is important, because the string will be deallocated later, typically in EXTRACTOR_freeKeywords()."

If that strdup() is so important, then where is it? ;)

strdup necessary

Mike W's picture

The strdup() referred to is in Listing 8 !!
R-E-A-D M-O-R-E C-A-R-E-F-U-L-L-Y !

Right there?

Christian_Grothoff's picture

The strdup can either be in addKeyword or, as in the article, before the call to addKeyword:

addKeyword(&prev,
strdup("image/jpeg"),
EXTRACTOR_MIMETYPE);

I'm also not aware of any strdup's missing (at the moment) in the actual source, so I'm not sure what your comment refers to. :-)

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix