Converting e-Books to Open Formats
Books in digital format, also known as e-books, can be read on devices lacking the power and screen space to afford a regular Web browser. Several publishers, not to mention projects such as Project Gutenberg, have provided thousands of new and classic titles in digital format. The problem is both the hardware—be it generic PDAs or dedicated devices—and the whole e-book publishing industry are much more fragmented than are PCs and Web browsers. Therefore, it is probable that the e-book you recently bought will not be readable ten years from now—nor tomorrow, should you decide to use a laptop or change PDAs. To help combat this fragmentation, this article discusses some existing command-line tools that can convert the most popular e-book formats to ASCII or HTML.
Practically no tools exist now to export e-book formats to PDF or OpenDocument, the new OASIS standard used in OpenOffice.org, but this is not necessarily a big deal. Once text is in ASCII or HTML format, it easily can be moved to plain-text or PDF format by using a text browser such as w3m or programs such as html2ps. If you go this route for conversion, you are able to do it today, and because it's an open format, 20 years from now too.
On PalmOS, the original and most common e-book format is PalmDoc, also called AportisDoc or simply Doc, even though it has nothing to do with Microsoft Word's .doc format. Doc, recognizable by the extensions .pdb (Palm Database) or .prc (Palm Resource Code), basically is a PalmPilot database composed of records strung together. This standard has spun off several variants, including MobiPocket, which adds embedded HTML markup tags to the basic format.
Each Palm e-book is divided into three sections: the header, a series of text records and a series of bookmark records. Normally, the header is 16 bytes wide. Some Doc readers may extend the width at run time to hold additional custom information. By default, the header contains data such as the total length of the uncompressed text, the position currently viewed in the document and an array of two-byte unsigned integers giving the uncompressed size of each text record. Usually, the maximum size for this kind of records is 4,096 bytes, and each one of them is compressed individually.
The bookmark records are composed of a 16-byte name and a 4-byte offset from the beginning of text. Because bookmarks are optional, many Doc e-books don't contain them, and most Doc readers support alternative—that is, non-portable—methods to specify them. Other reader-specific extensions might include category, version numbers and links between e-books. Almost always, this information is stored outside the .pdb or .rc file. Therefore, you should not expect to preserve this kind of data when converting your e-books.
Pyrite Publisher, formerly Doc Toolkit, is a set of content conversion tools for the Palm platform. Currently, only some text formats can be converted, but functionality can be extended to support new ones by way of Python plugins. Pyrite Publisher can download the documents to convert directly from the Web; it also can download set bookmarks directly to the output database. The package, which requires Python 2.1 or greater, can be used from the command line or through a wxWindows-based GUI. The software is available for Linux and Windows in both source and binary format. Should you choose the latter option, remember that compiled versions expect Python to be in /usr. The Linux version can install converted files straight to the PDA using JPilot or pilot-link.
Pyrite installed and ran flawlessly on Fedora Core 2. Unlike the other command-line converters presented below, however, Pyrite can save only in ASCII format, not in HTML. The name of the executable is pyrpub. The exact command for converting .pdb files uses this syntax:
pyrpub -P TextOutput -o don_quixote.txt \ Don_Quixote.pdb
Pyrite can be enough if all you want to do is quickly index a digital library. On the other hand, it is almost trivial to reformat the result to make it more readable in a browser. The snippet of Perl code in Listing 1, albeit ugly, was all it took to produce the version of Don Quixote shown in Figure 1.
Listing 1. A simple Perl script converts Pyrite's extracted text to HTML.
#! /usr/bin/perl undef $/; $TEXT = <>; $TEXT =~ s/\n\n/<p>/gm; print <<END_HTML; <html><body> $TEXT </body></html> END_HTML
The script loads the whole ASCII text previously generated with Publisher, and every time it finds two new lines in a row, it replaces them with HTML paragraph markers. The result then is printed to standard output and properly formatted as basic HTML. To change justification, fonts and colors, you simply need to paste your favourite stylesheet right after the <html><body> line.
OpenOffice.org 2.0, expected to be released in spring 2005, will be able to save text in .pdb format. If it also is able to read such files, its mass conversion feature (File→AutoPilot→Document Converter) would solve the problem nicely. I have tried to do this with the 1.9.m65 preview, but all I got was a General input/output error pop-up message. Hopefully, this functionality will be added to future versions.
Articles about Digital Rights and more at http://stop.zona-m.net CV, talks and bio at http://mfioretti.com
- Smoothwall Express
- Machine Learning Everywhere
- Bash Shell Script: Building a Better March Madness Bracket
- Own Your DNS Data
- Simple Server Hardening
- From vs. to + for Microsoft and Linux
- Understanding Firewalld in Multi-Zone Configurations
- The Weather Outside Is Frightful (Or Is It?)
- Understanding OpenStack's Success
- Ensono M.O.