OpenOffice.org ODF, Python and XML

 in
Combine Python with the open format of ODF files to manipulate fine details.

My wife is a writer, which today means she uses a word processing program. It's a sophisticated, powerful program—OpenOffice.org Writer—but occasionally it won't do something that she wants it to do. In this article, we take a look at the structure of OpenDocument Format (ODF) files and see how Python, with its XML libraries, can help. Figure 1 shows an example.

Figure 1. Converting Quotation Marks

It's not hard to convert quotation marks on a few paragraphs by hand—or even on a few pages, if I'm doing it only once. But having to repeat such manual operations on subsequent revisions becomes tedious, especially on a longer document, such as a poetry collection or novel. (We might have to repeat these operations after importing plain text from an e-mail message, for example.)

Fortunately, ODF is open, so we should be able to manipulate the file contents outside the word processing program.

Let's see if we can do that manually, just to make sure we know what we're doing. Once we can do that, we'll create a script to do some more ambitious things with the document.

Cracking the OpenDocument Format—A Simple Example

I read somewhere that an ODF file is a zip archive of XML files. So, let's see if it really is one—and if so, what's inside:

% unzip -l ex1.odt
Archive:  ex1.odt
  Length     Date   Time    Name
 --------    ----   ----    ----
       39  11-15-06 01:55   mimetype
        0  11-15-06 01:55   Configurations2/statusbar/
        0  11-15-06 01:55   Configurations2/accelerator/current.xml
        0  11-15-06 01:55   Configurations2/floater/
        0  11-15-06 01:55   Configurations2/popupmenu/
        0  11-15-06 01:55   Configurations2/progressbar/
        0  11-15-06 01:55   Configurations2/menubar/
        0  11-15-06 01:55   Configurations2/toolbar/
        0  11-15-06 01:55   Configurations2/images/Bitmaps/
        0  11-15-06 01:55   Pictures/
     2872  11-15-06 01:55   content.xml
     9786  11-15-06 01:55   styles.xml
     1109  11-15-06 01:55   meta.xml
      878  11-15-06 01:55   Thumbnails/thumbnail.png
     6611  11-15-06 01:55   settings.xml
     2037  11-15-06 01:55   META-INF/manifest.xml
 --------                   -------
    23332                   16 files
%

Good news—it is a zip archive.

So, the plan is this: unpack it, modify a file (or files) and pack everything back up again. We'll pack up files in the same order, just in case it matters. So, we need to save the file list.

The listing from running unzip has that file list, along with some other stuff. Let's select only the lines that have filenames (in this case, the lines with a : followed by digits) and print only the filenames. A single command to sed does that:

% unzip -l ex1.odt | sed -n '/:[0-9][0-9]/s|^.*:.. *||p'
mimetype
Configurations2/statusbar/
Configurations2/accelerator/current.xml
Configurations2/floater/
Configurations2/popupmenu/
Configurations2/progressbar/
Configurations2/menubar/
Configurations2/toolbar/
Configurations2/images/Bitmaps/
Pictures/
content.xml
styles.xml
meta.xml
Thumbnails/thumbnail.png
settings.xml
META-INF/manifest.xml
%

Looks good. Let's save the list in a shell variable—we'll use F (for files):

% F=$(unzip -l ex1.odt | sed -n '/:[0-9][0-9]/s|^.*:.. *||p')

With that settled, the next question is, which file to modify? To find out, let's find the file or files containing the word quotes, which appeared in the document. We'll unpack ex1.odt into an empty directory and ask grep, remembering to check files in subdirectories as well:

% cd TMP
% unzip -q ~/oo/ex1.odt
% find . -type f | xargs grep -l quote
./content.xml
%

Okay, content.xml is it. Text editors provide one way to manipulate content.xml, so let's give that a try. The relevant part looked like Figure 2 in Emacs.

Figure 2. Editing XML in Emacs

The two occurrences of " (partially highlighted in Figure 2) represent the straight quotation marks.

I changed the straight quotes to the appropriate curly or smart quotes (found on either side of the word nice), as shown in Figure 3. The changed areas are, again, partially highlighted.

Figure 3. Edited XML with Smart Quotes

With that done, let's zip the files (the list saved in $F) to create ex2.odt, and see what OpenOffice.org Writer thinks about it:

% zip -q ~/oo/ex2.odt $F
% oowriter ~/oo/ex2.odt

Figure 4. Writer Recognizes the New Quotes

It worked (Figure 4)! The formerly straight quotes around the word straight are now curly quotes, and they're even curled in the right direction. So, to review what we've done so far:

  • Created a list of the files in ex1.odt (saving it in $F).

  • Unpacked ex1.odt.

  • Made a simple change, manually, in content.xml.

  • Created ex2.odt (using $F).

  • Validated ex2.odt using OpenOffice.org Writer.

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix