OpenOffice.org ODF, Python and XML

 in
Combine Python with the open format of ODF files to manipulate fine details.
A Real-Life Example

That exercise proved the concept, so now we can get to work. My wife's poetry book was about 60 pages long, and it needed these issues addressed:

  1. Those straight quotes, which came from plain-text e-mail messages or other word processors.

  2. Apostrophes (or single quotes), which also were straight rather than curled the right way.

  3. Double hyphens and shorter dashes (the en dash), which should all be changed into the longer em dash.

OpenOffice.org Writer has keystroke sequences for creating the en dash as well as the longer em dash. Sometimes the wrong sequence was typed, so an en dash appeared instead of the desired em dash. Plain text imported from e-mail messages sometimes had double hyphens (that is, --).

Concretely, we want to transform what's shown in Figure 5 into what's shown in Figure 6.

Figure 5. Before...

Figure 6. ...and After

Let's develop the automated script in two pieces, and let's do it top-down. The top layer will create a temporary directory, unpack the original document and then run the bottom layer, a program (designated fixit.py) to modify content.xml. Afterward, it will pack up the files into the new document and clean up.

The Top Layer: a Shell Script

I want to use the highest-level language reasonable for each task; for this top layer, that's probably the shell. This script, called fixit.sh, turned out to be longer than I thought it would be, mostly because of all the error checking:

#!/bin/bash
# Script to fix up OpenDocument Text (.odt) files
# "cd" to the directory containing "fixit.py".

# Make $TMPDIR, a new temporary directory

TMPDIR=/tmp/ODFfixit.$(date +%y%m%d.%H%M%S).$$
if rm -rf $TMPDIR && mkdir $TMPDIR; then
   : # Be happy
else
   echo >&2 "Can't (re)create $TMPDIR; aborting"
   exit 1
fi

OLDFILE=$1
NEWFILE=$2

# Check number of parameters.
# Ensure $NEWFILE's dir exists and is writable.
# Quietly Unzip $OLDFILE. Whine and abort on error.

if [[ $# -eq 2 ]] &&
   touch $NEWFILE && rm -f $NEWFILE &&
                  unzip -q $OLDFILE -d $TMPDIR ; then
   : # All good; be happy.
else

   # Trouble! Print usage message, clean up, abort.

   echo >&2 "Usage: $0 OLDFILE NEWFILE"
   echo >&2 "  ... both OpenDocument Text (odt) files"
   echo >&2 "Note: 'OLDFILE' must already exist."
   rm -rf $TMPDIR
   exit 1
fi

# Save file list in $F; is content.xml there?

F=$(unzip -l $OLDFILE |
       sed -n '/:[0-9][0-9]/s|^.*:.. *||p')
if echo "$F" | grep -q '^content\.xml$'; then
   : # Good news; we have content.xml
else
   echo >&2 "content.xml not in $OLDFILE; aborting"
   echo >&2 TMPDIR is $TMPDIR
   exit 1
fi

# Now invoke the Python program to fix content.xml

mv $TMPDIR/content.xml $TMPDIR/OLDcontent.xml
if ./fixit.py $TMPDIR/OLDcontent.xml > \
                  $TMPDIR/content.xml; then
   : # It worked.
else
   echo >&2 "fixit.py failed in $TMPDIR; aborting"
   exit 1
fi

if (cd $TMPDIR; zip -q - $F) | cat > $NEWFILE; then
   # Everything worked! Clean up $TMPDIR
   rm -rf $TMPDIR
else # something Bad happened.
   echo >&2 "zip failed in $TMPDIR on $F"
   exit 1
fi

It's long but straightforward, so I explain only a few things here.

First, the temporary directory name includes the date and time (the date +% stuff), and the shell's process ID (the $$) prevents name collisions.

Second, the grep line looks the way it does because I want it to accept content.xml but not something like discontent.xml or content-xml.

Finally, we clean up the temporary directory ($TMPDIR) except in some error cases, where we leave it intact for debugging and tell the user where it is.

We can't run this script yet, because we don't yet have fixit.py actually modify content.xml. But, we can use a stub to validate what we have so far. The fixit.sh script assumes fixit.py will take one parameter (the original content.xml's pathname) and put the result onto stdout. This just happens to match the calling sequence for /bin/cat with one parameter; hence, if we use /bin/cat as our fixit.py, fixit.sh should give us a new document with the same content as the old. So, let's give it a whirl:

% ln -s /bin/cat fixit.py
% ./fixit.sh ex1.odt foo.odt
% ls -l ex1.odt foo.odt
-rw-r--r--  1 collin users 7839 2006-11-14 17:50 ex1.odt
-rw-r--r--  1 collin users 7900 2006-11-14 19:45 foo.odt
% oowriter foo.odt

The new file, foo.odt, is slightly larger than ex1.odt, but when I looked at it with OpenOffice.org Writer, it had the right stuff.

As far as writing a program for manipulating content.xml—well, back in the 1990s, I probably would have spent many hours with yacc (or bison)—but today, Python with its XML libraries is a more natural choice.

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix