PdfMasher--E-Book Conversion
If you've had problems reading PDF files on various devices (like mobile phones), PdfMasher may be just what you're looking for. According to the Web site:
PdfMasher is a tool to convert PDF files containing text in ready-for-e-book HTML files. Most e-book readers support PDF files natively, but it's often a real pain to read those documents, because we don't have font-size control over the document like we have with native e-books. In many cases, we have to use the zooming feature, and it's just a pain. Another drawback of PDFs on e-book readers is that annotations are not supported.
There are already tools to convert PDFs to e-books, like Calibre, but what they do is try to guess the role of each piece of text in the PDF (and that's if you're lucky). I think that in all but the simplest cases, it's a mistake to think that anything short of an AI can do that kind of guessing.
Using PdfMasher, PDF files like these can be manipulated manually for conversion into other formats.
With the original PDF on the left and outputted HTML on the right, this e-book now can be read on any device without readability woes.
Installation
If you can install this with a binary, by all means do so. Available on the site are 32- and 64-bit Linux .deb packages for the ubiquitous Intel x86 architecture. For masochists, or those who don't have an Intel-based CPU, there is the obligatory source.
In order to grab the latest source, first you need to install hg, which was under the package name "mercurial" on my Kubuntu system. Once that's installed, grab the latest source by entering the command:
$ hg clone https://bitbucket.org/hsoft/pdfmasher
Once that has finished downloading, keep this terminal open where it is, because next you'll need to sort out the library requirements, and then you'll return to this terminal and continue the installation. As far as dependencies are concerned, the documentation lists the following:
-
Python 3.2 http://www.python.org
-
pdfminer3k http://hg.hardcoded.net/pdfminer3k
-
jobprogress 1.0.0 http://hg.hardcoded.net/jobprogress
-
Sphinx 1.0.7 http://sphinx.pocoo.org
-
pytest 2.0.3 to run unit tests http://pytest.org
-
Markdown 2.0.3 http://www.freewisdom.org/projects/python-markdown
-
PyQt 4.7.5 http://www.riverbankcomputing.co.uk/news
With the dependencies out of the way, re-open the terminal from before and enter the following commands:
$ cd pdfmasher
$ python configure.py
$ python build.py
Then, run the program with:
$ python run.py
If you're lucky enough to have the binary installed, you simply can run the program with the command:
$ pdfmasher
Usage
Before I try to explain how to use PdfMasher myself, I should include the following from the Web site:
PdfMasher asks the user about the role of each piece of text, and does it in an efficient manner. Your PDF has a header on each page, and you don't want them to litter your text? Sort text elements by Y-position (thus grouping them all together); Shift-select the elements and flag them as ignored. They will not appear on your final HTML. Your PDF has footnotes on many pages? Sort your elements by text content (thus grouping all elements with the text starting with a number together) and flag them as footnotes. They will be moved to the end of the document, and PdfMasher will try to create hyperlinks to footnote references.
Before changing things under PdfMasher, I recommend having your PDF open to one side in another program so you can cross-check bits of text as you're culling sections. When you're ready to start, click on Open File and choose the PDF you want to "mash".
Once open, the pane below fills up in a manner that at first glance is overwhelming and incomprehensible. However, on a very basic level, each line is a section of text in your PDF. If you explore each line, you can check which part of the PDF is being examined, and if it's redundant, you can choose to ignore it in the conversion.
Looking at these PdfMasher lines in detail, each line has an X and Y axis reference, as well as font size, text length and page number. Whenever you click a line, the full text content of its section in the PDF is shown in the pane below.
If you've decided on which sections to remove, click Ignore to cut out the text from the final product. Click Normal to reinstate the text for inclusion. Depending on which device you'll be reading the resulting e-book, the header and footer information may be something you want to cut out of the page.
For example, in the screenshot, I'm removing the beginning references and page headers in a psychology paper that otherwise would leave a hard-to-navigate, garbled mess if I translated it into something I could read on my phone.
However, if what you're preparing is intended to be something like a public Web page instead of a trimmed-down e-book, you might want to use the Title and Footnote buttons. Title will result in an H1 title header in the outputted HTML. The Footnote button will move the text to the bottom of the document, and PdfMasher will try to make one of the cool hyperlinks mentioned earlier.
Once you've finished editing your document, click on the Build tab below, and then click on the Generate Markdown button. A raw text file will be generated in the same folder as the original PDF. Click on Reveal Markdown, and the source folder will be opened in your default file manager. Edit Markdown will open the actual text file in your default text editor, and View HTML will show the end product in a Web browser.
If you've made any errors, the output will reveal them quickly, and you can go back and simply start the Build process again. From here, you either can leave your output as is or convert your files into specific e-book formats.
Either way, PdfMasher uses some very simple methods to create something very clever and is a must-have for any regular e-book reader.
Learn More: http://www.hardcoded.net/pdfmasher
John Knight is the New Projects columnist for Linux Journal.
Trending Topics
| OpenLDAP Everywhere Reloaded, Part I | May 23, 2012 |
| Chemistry the Gromacs Way | May 21, 2012 |
| Make TV Awesome with Bluecop | May 16, 2012 |
| Hack and / - Password Cracking with GPUs, Part I: the Setup | May 15, 2012 |
| An Introduction to Application Development with Catalyst and Perl | May 14, 2012 |
| Cryptocurrency: Your Total Cost Is 01001010010 | May 09, 2012 |
- OpenLDAP Everywhere Reloaded, Part I
- Strip DRM from WMV File
- Validate an E-Mail Address with PHP, the Right Way
- Boot with GRUB
- Why Python?
- A Statistical Approach to the Spam Problem
- Chapter 16: Ubuntu and Your iPod
- Why Hulu Plus Sucks, and Why You Should Use It Anyway
- Building an Ultra-Low-Power File Server with the Trim-Slice
- Science the GNU Way, Part I
- Editorial Standards?
4 hours 4 min ago - Great one
5 hours 39 min ago - Common form in many
6 hours 51 sec ago - Awsome
11 hours 3 min ago - Euro 2012 Coupon Codes - Get 20% Off Pavtube TiVo Converter
3 days 9 hours ago - Euro 2012 Big Sale: 20% Off Instant Savings on TiVo Converter
3 days 9 hours ago - MakeMKV works as well, though
3 days 9 hours ago - Euro 2012 Big Sale: 20% Off Instant Savings on TiVo Converter
3 days 10 hours ago - Awesome
4 days 8 hours ago - Who worries approx the
4 days 10 hours ago





Comments
This is a great feature and I
This is a great feature and I am sure that there will be a lot of people interested to use it, I think I am going to install too this on my phone.
Asigurare
cannot build pdfmasher on Ubuntu 10.04 after all sorts packages
I've tried loading all sorts of packages from all sorts of sources and continue to fail at a pdfmasher build. I resorted to 'build' after I could not satisfy the DEB file dependencies.
Running python ./configure.py --ui=qt results in:
Traceback (most recent call last):
File "./configure.py", line 11, in
from argparse import ArgumentParser
ImportError: No module named argparse
Another time the same command gave me:
File "./configure.py", line 15
if ui not in {'cocoa', 'qt'}:
^
SyntaxError: invalid syntax
So then I tried to use 'pip' to install things:
prompt$ pip install requirements-lnx.txt
Unknown or unsupported command 'install'
I'm new to python development, but I've meed a code slinger for year, so I suspect that I'm missing something very obvious.
~~~ 0;-Dan
Ubuntu 10.04 and Python 3.2
The .deb file available on the pdfmasher site has a dependency for >=python3.2. Ubuntu 10.04 does not have this in the repos. Anyone have a howto or know of a PPA with the dependencies to get this going on Ubuntu 10.04? The IRIE Shinsuke PPA has python 3.2 but not all the dependencies.
Thank you
Tools like this make computers computers again.
Now if we can just stop people making PDF's and other non-computer shaped documents...
PDFMasher is a good tool, but not a replacement of Calibre
hi John,
Please make it clear that this tool, is not, a replacement of Calibre. Maybe it's a better PDF converter, but it is definitely not an eBook
catalog.
Thanks a lot for the introduction of this tool, it is very handy.
And why not to mail Kovid Goyal, Calibre author, and present this tool? Maybe there's a way to use them in the same interface. They are both Python devs.
Best,
Eugenio
The Calibre comments are the
The Calibre comments are the author's, not mine. You may wish to send your objections their way.
John Knight is the New Projects columnist for Linux Journal.