Converting e-Books to Open Formats
Books in digital format, also known as e-books, can be read on devices lacking the power and screen space to afford a regular Web browser. Several publishers, not to mention projects such as Project Gutenberg, have provided thousands of new and classic titles in digital format. The problem is both the hardware—be it generic PDAs or dedicated devices—and the whole e-book publishing industry are much more fragmented than are PCs and Web browsers. Therefore, it is probable that the e-book you recently bought will not be readable ten years from now—nor tomorrow, should you decide to use a laptop or change PDAs. To help combat this fragmentation, this article discusses some existing command-line tools that can convert the most popular e-book formats to ASCII or HTML.
Practically no tools exist now to export e-book formats to PDF or OpenDocument, the new OASIS standard used in OpenOffice.org, but this is not necessarily a big deal. Once text is in ASCII or HTML format, it easily can be moved to plain-text or PDF format by using a text browser such as w3m or programs such as html2ps. If you go this route for conversion, you are able to do it today, and because it's an open format, 20 years from now too.
On PalmOS, the original and most common e-book format is PalmDoc, also called AportisDoc or simply Doc, even though it has nothing to do with Microsoft Word's .doc format. Doc, recognizable by the extensions .pdb (Palm Database) or .prc (Palm Resource Code), basically is a PalmPilot database composed of records strung together. This standard has spun off several variants, including MobiPocket, which adds embedded HTML markup tags to the basic format.
Each Palm e-book is divided into three sections: the header, a series of text records and a series of bookmark records. Normally, the header is 16 bytes wide. Some Doc readers may extend the width at run time to hold additional custom information. By default, the header contains data such as the total length of the uncompressed text, the position currently viewed in the document and an array of two-byte unsigned integers giving the uncompressed size of each text record. Usually, the maximum size for this kind of records is 4,096 bytes, and each one of them is compressed individually.
The bookmark records are composed of a 16-byte name and a 4-byte offset from the beginning of text. Because bookmarks are optional, many Doc e-books don't contain them, and most Doc readers support alternative—that is, non-portable—methods to specify them. Other reader-specific extensions might include category, version numbers and links between e-books. Almost always, this information is stored outside the .pdb or .rc file. Therefore, you should not expect to preserve this kind of data when converting your e-books.
Pyrite Publisher, formerly Doc Toolkit, is a set of content conversion tools for the Palm platform. Currently, only some text formats can be converted, but functionality can be extended to support new ones by way of Python plugins. Pyrite Publisher can download the documents to convert directly from the Web; it also can download set bookmarks directly to the output database. The package, which requires Python 2.1 or greater, can be used from the command line or through a wxWindows-based GUI. The software is available for Linux and Windows in both source and binary format. Should you choose the latter option, remember that compiled versions expect Python to be in /usr. The Linux version can install converted files straight to the PDA using JPilot or pilot-link.
Pyrite installed and ran flawlessly on Fedora Core 2. Unlike the other command-line converters presented below, however, Pyrite can save only in ASCII format, not in HTML. The name of the executable is pyrpub. The exact command for converting .pdb files uses this syntax:
pyrpub -P TextOutput -o don_quixote.txt \ Don_Quixote.pdb
Pyrite can be enough if all you want to do is quickly index a digital library. On the other hand, it is almost trivial to reformat the result to make it more readable in a browser. The snippet of Perl code in Listing 1, albeit ugly, was all it took to produce the version of Don Quixote shown in Figure 1.
Listing 1. A simple Perl script converts Pyrite's extracted text to HTML.
#! /usr/bin/perl undef $/; $TEXT = <>; $TEXT =~ s/\n\n/<p>/gm; print <<END_HTML; <html><body> $TEXT </body></html> END_HTML
The script loads the whole ASCII text previously generated with Publisher, and every time it finds two new lines in a row, it replaces them with HTML paragraph markers. The result then is printed to standard output and properly formatted as basic HTML. To change justification, fonts and colors, you simply need to paste your favourite stylesheet right after the <html><body> line.
OpenOffice.org 2.0, expected to be released in spring 2005, will be able to save text in .pdb format. If it also is able to read such files, its mass conversion feature (File→AutoPilot→Document Converter) would solve the problem nicely. I have tried to do this with the 1.9.m65 preview, but all I got was a General input/output error pop-up message. Hopefully, this functionality will be added to future versions.
Articles about Digital Rights and more at http://stop.zona-m.net CV, talks and bio at http://mfioretti.com
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Sony Settles in Linux Battle
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- Libarchive Security Flaw Discovered
- Profiles and RC Files
- Maru OS Brings Debian to Your Phone
- Understanding Ceph and Its Place in the Market
- Snappy Moves to New Platforms
- The Giant Zero, Part 0.x
- Git 2.9 Released
- Astronomy for KDE
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide