Reading File Metadata with extract and libextractor
Modern file formats have provisions to annotate the contents of the file with descriptive information. This development is driven by the need to find a better way to organize data than merely by using filenames. The problem with such metadata is it is not stored in a standardized manner across different file formats. This makes it difficult for format-agnostic tools, such as file managers or file-sharing applications, to make use of the information. It also results in a plethora of format-specific tools used to extract the metadata, such as AVInfo, id3edit, jpeginfo and Vocoditor.
In this article, the libextractor library and the extract tool are introduced. The goal of the libextractor Project is to provide a uniform interface for obtaining metadata from different file formats. libextractor currently is used by evidence, the file manager for the forthcoming version of Enlightenment, as well as for GNUnet, an anonymous, censorship-resistant peer-to-peer file-sharing system. The extract tool is a command-line interface to the library. libextractor is licensed under the GNU General Public License.
libextractor shares some similarities with the popular file tool, which uses the first bytes in a file to guess the MIME type. libextractor differs from file in that it tries to obtain much more information than the MIME type. Depending on the file format, libextractor can obtain additional information, including the name of the software used to create the file, the author, descriptions, album titles, image dimensions or the duration of a movie.
libextractor achieves this information by using specific parser code for many popular formats. The list currently includes MP3, Ogg, Real Media, MPEG, RIFF (avi), GIF, JPEG, PNG, TIFF, HTML, PDF, PostScript, Zip, OpenOffice.org, StarOffice, Microsoft Office, tar, DVI, man, Deb, elf, RPM, asf, as well as generic methods such as MIME-type detection. Many other formats exist, and among the more popular formats only a few proprietary formats are not supported.
Integrating support for new formats is easy, because libextractor uses plugins to gather data. libextractor plugins are shared libraries that typically provide code to parse one particular format. At the end of this article, we demonstrate how to integrate support for new formats into the library. libextractor gathers the metadata obtained from various plugins and provides clients with a list of pairs, consisting of a classification and a character sequence. The classification is used to organize the metadata into categories such as title, creator, subject and description.
The simplest way to install libextractor is to use one of the binary packages available for many distributions. Under Debian, the extract tool is in a separate package, extract. Headers required to compile other applications against libextractor are contained in libextractor0-devel. If you want to compile libextractor from source, you need an unusual amount of memory: 256MB of system memory is roughly the minimum, as GCC uses about 200MB to compile one of the plugins. Otherwise, compiling by hand follows the usual sequence of steps, as shown in Listing 1.
Listing 1. Compiling libextractor requires about 200MB of memory.
$ wget http://ovmj.org/libextractor/ ↪download/libextractor-0.4.1.tar.gz $ tar xvfz libextractor-0.4.1.tar.gz $ cd libextractor-0.4.1 $ ./configure --prefix=/usr/local $ make # make install
After installing libextractor, the extract tool can be used to obtain metadata from documents. By default, the extract tool uses a canonical set of plugins, which consists of all file-format-specific plugins supported by the current version of libextractor, together with the mime-type detection plugin. Example output for the Linux Journal Web site is shown in Listing 2.
Listing 2. Extracting metadata from HTML.
$ wget -q http://www.linuxjournal.com/ $ extract index.html description - The Monthly Magazine of the Linux Community keywords - linux, linux journal, magazine
If you are a user of BibTeX, the option -b is likely to come in handy to create BibTeX entries automatically from documents that have been equipped properly with metadata, as shown in Listing 3.
Listing 3. Creating BibTeX entries can be trivial if the documents come with plenty of metadata.
$ wget -q http://www.copyright.gov/legislation/dmca.pdf
$ extract -b ~/dmca.pdf
% BiBTeX file
@misc{ unite2001the_d,
title = "The Digital Millennium Copyright Act
of 1998",
author = "United States Copyright Office - jmf",
note = "digital millennium copyright act
circumvention technological protection management
information online service provider liability
limitation computer maintenance competition
repair ephemeral recording webcasting distance
education study vessel hull",
year = "2001",
month = "10",
key = "Copyright Office Summary of the DMCA",
pages = "18"
}
Another interesting option is -B LANG. This option loads one of the language-specific but format-agnostic plugins. These plugins attempt to find plain text in a document by matching strings in the document against a dictionary. If the need for 200MB of memory to compile libextractor seems mysterious, the answer lies in these plugins. In order to perform a fast dictionary search, a bloomfilter is created that allows fast probabilistic matching; GCC finds the resulting data structure a bit hard to swallow.
The option -B is useful for formats that currently are undocumented or unsupported. The printable plugins typically print the entire text of the document in order. Listing 4 shows the output of extract run on a Microsoft Word document.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- A Topic for Discussion - Open Source Feature-Richness?
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Home, My Backup Data Center
- What's the tweeting protocol?
- One Hand Slapping
- The Secret Password Is...
- Trying to Tame the Tablet
- RSS Feeds
- Reply to comment | Linux Journal
6 hours 11 min ago - Reply to comment | Linux Journal
8 hours 44 min ago - Reply to comment | Linux Journal
10 hours 1 min ago - great post
10 hours 36 min ago - Google Docs
10 hours 59 min ago - Reply to comment | Linux Journal
15 hours 47 min ago - Reply to comment | Linux Journal
16 hours 34 min ago - Web Hosting IQ
18 hours 8 min ago - Thanks for taking the time to
19 hours 44 min ago - Linux is good
21 hours 42 min ago
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




Comments
Test this tool online
Online metadata reader is using libextractor. Might be handy if someone wants to test the results first without installing libextractor.
Can this extract index
Can this extract index information of PDF files?
Extracting titles from word documents on linux
WORD DOCS ARE SUPPORTED
A quick scan of the examples on this page initally made it seem to me as if the extract program does not support Microsoft Word Documents.
Closer inspection reveals that extracting metadata from Office documents is supported.
[foo@localhost ~]$ extract foo.doc
mimetype - application/vnd.ms-files
os - Win32
organization - Foo Publishing
page count - 1
modification date - Tue Sep 6 16:10:00 2005
software - Microsoft Office Word
version - 3
format - ABC123
keywords - SCADA, Cryptographic Protection, Communications
author - ABC123 Task Group
subject - Cryptographic Protection of SCADA Communications
title - ABC123 Draft 3
[foo@localhost ~]$
Missing strdup()?
"The strdup in the code is important, because the string will be deallocated later, typically in EXTRACTOR_freeKeywords()."
If that strdup() is so important, then where is it? ;)
strdup necessary
The strdup() referred to is in Listing 8 !!
R-E-A-D M-O-R-E C-A-R-E-F-U-L-L-Y !
Right there?
The strdup can either be in addKeyword or, as in the article, before the call to addKeyword:
addKeyword(&prev,
strdup("image/jpeg"),
EXTRACTOR_MIMETYPE);
I'm also not aware of any strdup's missing (at the moment) in the actual source, so I'm not sure what your comment refers to. :-)