Searching PDF Files With grep

FAIL (the browser should render some flash content, not this).

Download in .ogv format

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Why not use strings for the same effect with more portability

ojblass's picture

for a in *.pdf; do
strings $a | grep Copyleft
done

Because It Doesn't Work

Mitch Frazier's picture

I'm not sure if it never works, but it doesn't always work. PDF files contain compressed data and the characters of a word may not always be right next to each other in the data. If I use strings I can't find things that I could find with pdftotext:

  $ strings datasheet.pdf |grep Charac
  #### nothing found ####

  $ pdftotext datasheet.pdf -| grep Charac
  ... Part 27pF Tech. Characteristics 50V-20% Ceramic Package CASE 0603

Mitch Frazier is an Associate Editor for Linux Journal.

CentOS (Redat) uses "poppler-utils" ...

Joe Mama's picture

........ CentOS (Redat) uses "poppler-utils" ............

Using YUM, this also installs "poppler" and "poppler-utils"
.
.
THANKS FOR THIS NICE TIP!

great tip thanks

turgut's picture

that was very useful, on Fedora I found xpdf , no poppler no xpdf-tools. Oh yes, it has pdftotext in it.. thanks! -t

SuSE PDF Package

Anonymous's picture

When I search rpmfind.net, poppler-tools only provides rendering libraries (which requires poppler, based on the xpdf-3.0 code base). The utilities themselves are not provided in any poppler-tools rpm I can find. On my OpenSuSE 11 system, pdftotext is provided by xpdf-tools.

From http://poppler.freedesktop.org:
"Poppler is a PDF rendering library based on the xpdf-3.0 code base."

Check OpenSuSE's site

Mitch Frazier's picture

You can find the RPM's on opensuse.org:

http://download.opensuse.org/repositories/openSUSE:/11.0/standard/i586/poppler-tools-0.8.2-1.1.i586.rpm
http://download.opensuse.org/repositories/openSUSE:/11.0/standard/x86_64/poppler-tools-0.8.2-1.1.x86_64.rpm

I also use OpenSuSE 11.0:

$ cat /etc/SuSE-release
openSUSE 11.0 (X86-64)
VERSION = 11.0

$ rpm -q -l poppler-tools
/usr/bin/pdffonts
/usr/bin/pdfimages
/usr/bin/pdfinfo
/usr/bin/pdftoabw
/usr/bin/pdftohtml
/usr/bin/pdftoppm
/usr/bin/pdftops
/usr/bin/pdftotext
/usr/share/man/man1/pdffonts.1.gz
/usr/share/man/man1/pdfimages.1.gz
/usr/share/man/man1/pdfinfo.1.gz
/usr/share/man/man1/pdftohtml.1.gz
/usr/share/man/man1/pdftoppm.1.gz
/usr/share/man/man1/pdftops.1.gz
/usr/share/man/man1/pdftotext.1.gz

Mitch Frazier is an Associate Editor for Linux Journal.

The poppler-tools RPM

Anonymous's picture

The poppler-tools RPM doesn't appear to be the standard (yet -- see below), therefore it may not be available on all distros (which is probably why it's not on rpmfind.net). I have SLED11, based on OpenSuSE 11, but there is no poppler-tools RPM. I was not able to check Fedora, RedHat or Debian-based distros (i.e. Ubuntu, Knoppix, etc).

However, like I said, all of the contents of poppler-tools are included in xpdf-tools (with the exception of pdftohtml and pdftoabw). This is available from rpmfind.net for a few distros and appears to be more readily available for the more common distros.

The "rpm --what-provides..." in your tip would still apply regardless of what the source RPM is; however, I mention xpdf-tools to help users who may be confused by their own results of what is returned by --what-provides, which may not necessarily be poppler-tools. The functionality of the pdfto[whatever] tools appears to be similar, though.

Result of rpm -qp poppler-tools-0.8.2-1.1.i586.rpm --info:

Poppler is a fork of the xpdf PDF viewer developed by Derek Noonburg of
Glyph and Cog, LLC. The purpose of forking xpdf is twofold. First, to
provide PDF rendering functionality as a shared library to centralize
the maintenence effort. Today a number of applications incorporate the
xpdf code base and whenever a security issue is discovered, all these
applications exchange patches and put out new releases. In turn, all
distributions must package and release new versions of these xpdf based
viewers. It is safe to say that there is a lot of duplicated effort
with the current situation. Even if poppler in the short term
introduces yet another xpdf-derived code base to the world, it is hoped
that over time these applications will adopt poppler. After all, we
only need one application to use poppler to break even.

Second, we would like to move libpoppler forward in a number of areas
that do not fit within the goals of xpdf. By design, xpdf depends on
very few libraries and runs on a wide range of X-based platforms. This
is a strong feature and reasonable design goal. However, poppler
intends to replace parts of xpdf that are now available as standard
components of modern Unix desktop environments. One such example is
fontconfig, which solves the problem of matching and locating fonts on
the system in a standardized and well understood way. Another example
is cairo, which provides high quality 2D rendering. See the file TODO
for a list of planned changes.

So, as with most things in the whacky world of open-source, it would appear poppler-tools is trying to move away from xpdf-based code and rely solely on libpoppler. This basically gives users a choice of which PDF library implementation to use.

a couple of quick scripts

quaid's picture

Thanks for the tip, that's a great and elegant solution.

One step more, I threw together these two scripts, one for grepping a single PDF, the other for grepping a directory of PDFs.

http://iquaid.org/programs/grep-pdf
http://iquaid.org/programs/grep-pdf-multi

Typical bash script, more copyright than code. Let me know if you make any improvements, I'd like to use them, too. :)

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState