Reading File Metadata with extract and libextractor
Modern file formats have provisions to annotate the contents of the file with descriptive information. This development is driven by the need to find a better way to organize data than merely by using filenames. The problem with such metadata is it is not stored in a standardized manner across different file formats. This makes it difficult for format-agnostic tools, such as file managers or file-sharing applications, to make use of the information. It also results in a plethora of format-specific tools used to extract the metadata, such as AVInfo, id3edit, jpeginfo and Vocoditor.
In this article, the libextractor library and the extract tool are introduced. The goal of the libextractor Project is to provide a uniform interface for obtaining metadata from different file formats. libextractor currently is used by evidence, the file manager for the forthcoming version of Enlightenment, as well as for GNUnet, an anonymous, censorship-resistant peer-to-peer file-sharing system. The extract tool is a command-line interface to the library. libextractor is licensed under the GNU General Public License.
libextractor shares some similarities with the popular file tool, which uses the first bytes in a file to guess the MIME type. libextractor differs from file in that it tries to obtain much more information than the MIME type. Depending on the file format, libextractor can obtain additional information, including the name of the software used to create the file, the author, descriptions, album titles, image dimensions or the duration of a movie.
libextractor achieves this information by using specific parser code for many popular formats. The list currently includes MP3, Ogg, Real Media, MPEG, RIFF (avi), GIF, JPEG, PNG, TIFF, HTML, PDF, PostScript, Zip, OpenOffice.org, StarOffice, Microsoft Office, tar, DVI, man, Deb, elf, RPM, asf, as well as generic methods such as MIME-type detection. Many other formats exist, and among the more popular formats only a few proprietary formats are not supported.
Integrating support for new formats is easy, because libextractor uses plugins to gather data. libextractor plugins are shared libraries that typically provide code to parse one particular format. At the end of this article, we demonstrate how to integrate support for new formats into the library. libextractor gathers the metadata obtained from various plugins and provides clients with a list of pairs, consisting of a classification and a character sequence. The classification is used to organize the metadata into categories such as title, creator, subject and description.
The simplest way to install libextractor is to use one of the binary packages available for many distributions. Under Debian, the extract tool is in a separate package, extract. Headers required to compile other applications against libextractor are contained in libextractor0-devel. If you want to compile libextractor from source, you need an unusual amount of memory: 256MB of system memory is roughly the minimum, as GCC uses about 200MB to compile one of the plugins. Otherwise, compiling by hand follows the usual sequence of steps, as shown in Listing 1.
Listing 1. Compiling libextractor requires about 200MB of memory.
$ wget http://ovmj.org/libextractor/ ↪download/libextractor-0.4.1.tar.gz $ tar xvfz libextractor-0.4.1.tar.gz $ cd libextractor-0.4.1 $ ./configure --prefix=/usr/local $ make # make install
After installing libextractor, the extract tool can be used to obtain metadata from documents. By default, the extract tool uses a canonical set of plugins, which consists of all file-format-specific plugins supported by the current version of libextractor, together with the mime-type detection plugin. Example output for the Linux Journal Web site is shown in Listing 2.
Listing 2. Extracting metadata from HTML.
$ wget -q http://www.linuxjournal.com/ $ extract index.html description - The Monthly Magazine of the Linux Community keywords - linux, linux journal, magazine
If you are a user of BibTeX, the option -b is likely to come in handy to create BibTeX entries automatically from documents that have been equipped properly with metadata, as shown in Listing 3.
Listing 3. Creating BibTeX entries can be trivial if the documents come with plenty of metadata.
$ wget -q http://www.copyright.gov/legislation/dmca.pdf
$ extract -b ~/dmca.pdf
% BiBTeX file
@misc{ unite2001the_d,
title = "The Digital Millennium Copyright Act
of 1998",
author = "United States Copyright Office - jmf",
note = "digital millennium copyright act
circumvention technological protection management
information online service provider liability
limitation computer maintenance competition
repair ephemeral recording webcasting distance
education study vessel hull",
year = "2001",
month = "10",
key = "Copyright Office Summary of the DMCA",
pages = "18"
}
Another interesting option is -B LANG. This option loads one of the language-specific but format-agnostic plugins. These plugins attempt to find plain text in a document by matching strings in the document against a dictionary. If the need for 200MB of memory to compile libextractor seems mysterious, the answer lies in these plugins. In order to perform a fast dictionary search, a bloomfilter is created that allows fast probabilistic matching; GCC finds the resulting data structure a bit hard to swallow.
The option -B is useful for formats that currently are undocumented or unsupported. The printable plugins typically print the entire text of the document in order. Listing 4 shows the output of extract run on a Microsoft Word document.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Speed Up Your Web Site with Varnish | Jun 19, 2013 |
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
- Speed Up Your Web Site with Varnish
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Linux Systems Administrator
- RSS Feeds
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Tech Tip: Really Simple HTTP Server with Python
- Cari Uang
3 hours 23 min ago - user namespaces
6 hours 17 min ago - yea
6 hours 43 min ago - One advantage with VMs
9 hours 11 min ago - about info
9 hours 44 min ago - info
9 hours 45 min ago - info
9 hours 46 min ago - info
9 hours 48 min ago - info
9 hours 49 min ago - abut info
9 hours 51 min ago
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
Test this tool online
Online metadata reader is using libextractor. Might be handy if someone wants to test the results first without installing libextractor.
Can this extract index
Can this extract index information of PDF files?
Extracting titles from word documents on linux
WORD DOCS ARE SUPPORTED
A quick scan of the examples on this page initally made it seem to me as if the extract program does not support Microsoft Word Documents.
Closer inspection reveals that extracting metadata from Office documents is supported.
[foo@localhost ~]$ extract foo.doc
mimetype - application/vnd.ms-files
os - Win32
organization - Foo Publishing
page count - 1
modification date - Tue Sep 6 16:10:00 2005
software - Microsoft Office Word
version - 3
format - ABC123
keywords - SCADA, Cryptographic Protection, Communications
author - ABC123 Task Group
subject - Cryptographic Protection of SCADA Communications
title - ABC123 Draft 3
[foo@localhost ~]$
Missing strdup()?
"The strdup in the code is important, because the string will be deallocated later, typically in EXTRACTOR_freeKeywords()."
If that strdup() is so important, then where is it? ;)
strdup necessary
The strdup() referred to is in Listing 8 !!
R-E-A-D M-O-R-E C-A-R-E-F-U-L-L-Y !
Right there?
The strdup can either be in addKeyword or, as in the article, before the call to addKeyword:
addKeyword(&prev,
strdup("image/jpeg"),
EXTRACTOR_MIMETYPE);
I'm also not aware of any strdup's missing (at the moment) in the actual source, so I'm not sure what your comment refers to. :-)