OpenOffice.org ODF, Python and XML
That exercise proved the concept, so now we can get to work. My wife's poetry book was about 60 pages long, and it needed these issues addressed:
Those straight quotes, which came from plain-text e-mail messages or other word processors.
Apostrophes (or single quotes), which also were straight rather than curled the right way.
Double hyphens and shorter dashes (the en dash), which should all be changed into the longer em dash.
OpenOffice.org Writer has keystroke sequences for creating the en dash as well as the longer em dash. Sometimes the wrong sequence was typed, so an en dash appeared instead of the desired em dash. Plain text imported from e-mail messages sometimes had double hyphens (that is, --).
Concretely, we want to transform what's shown in Figure 5 into what's shown in Figure 6.
Let's develop the automated script in two pieces, and let's do it top-down. The top layer will create a temporary directory, unpack the original document and then run the bottom layer, a program (designated fixit.py) to modify content.xml. Afterward, it will pack up the files into the new document and clean up.
I want to use the highest-level language reasonable for each task; for this top layer, that's probably the shell. This script, called fixit.sh, turned out to be longer than I thought it would be, mostly because of all the error checking:
#!/bin/bash
# Script to fix up OpenDocument Text (.odt) files
# "cd" to the directory containing "fixit.py".
# Make $TMPDIR, a new temporary directory
TMPDIR=/tmp/ODFfixit.$(date +%y%m%d.%H%M%S).$$
if rm -rf $TMPDIR && mkdir $TMPDIR; then
: # Be happy
else
echo >&2 "Can't (re)create $TMPDIR; aborting"
exit 1
fi
OLDFILE=$1
NEWFILE=$2
# Check number of parameters.
# Ensure $NEWFILE's dir exists and is writable.
# Quietly Unzip $OLDFILE. Whine and abort on error.
if [[ $# -eq 2 ]] &&
touch $NEWFILE && rm -f $NEWFILE &&
unzip -q $OLDFILE -d $TMPDIR ; then
: # All good; be happy.
else
# Trouble! Print usage message, clean up, abort.
echo >&2 "Usage: $0 OLDFILE NEWFILE"
echo >&2 " ... both OpenDocument Text (odt) files"
echo >&2 "Note: 'OLDFILE' must already exist."
rm -rf $TMPDIR
exit 1
fi
# Save file list in $F; is content.xml there?
F=$(unzip -l $OLDFILE |
sed -n '/:[0-9][0-9]/s|^.*:.. *||p')
if echo "$F" | grep -q '^content\.xml$'; then
: # Good news; we have content.xml
else
echo >&2 "content.xml not in $OLDFILE; aborting"
echo >&2 TMPDIR is $TMPDIR
exit 1
fi
# Now invoke the Python program to fix content.xml
mv $TMPDIR/content.xml $TMPDIR/OLDcontent.xml
if ./fixit.py $TMPDIR/OLDcontent.xml > \
$TMPDIR/content.xml; then
: # It worked.
else
echo >&2 "fixit.py failed in $TMPDIR; aborting"
exit 1
fi
if (cd $TMPDIR; zip -q - $F) | cat > $NEWFILE; then
# Everything worked! Clean up $TMPDIR
rm -rf $TMPDIR
else # something Bad happened.
echo >&2 "zip failed in $TMPDIR on $F"
exit 1
fi
It's long but straightforward, so I explain only a few things here.
First, the temporary directory name includes the date and time (the date +% stuff), and the shell's process ID (the $$) prevents name collisions.
Second, the grep line looks the way it does because I want it to accept content.xml but not something like discontent.xml or content-xml.
Finally, we clean up the temporary directory ($TMPDIR) except in some error cases, where we leave it intact for debugging and tell the user where it is.
We can't run this script yet, because we don't yet have fixit.py actually modify content.xml. But, we can use a stub to validate what we have so far. The fixit.sh script assumes fixit.py will take one parameter (the original content.xml's pathname) and put the result onto stdout. This just happens to match the calling sequence for /bin/cat with one parameter; hence, if we use /bin/cat as our fixit.py, fixit.sh should give us a new document with the same content as the old. So, let's give it a whirl:
% ln -s /bin/cat fixit.py % ./fixit.sh ex1.odt foo.odt % ls -l ex1.odt foo.odt -rw-r--r-- 1 collin users 7839 2006-11-14 17:50 ex1.odt -rw-r--r-- 1 collin users 7900 2006-11-14 19:45 foo.odt % oowriter foo.odt
The new file, foo.odt, is slightly larger than ex1.odt, but when I looked at it with OpenOffice.org Writer, it had the right stuff.
As far as writing a program for manipulating content.xml—well, back in the 1990s, I probably would have spent many hours with yacc (or bison)—but today, Python with its XML libraries is a more natural choice.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
| Introduction to MapReduce with Hadoop on Linux | Jun 05, 2013 |
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Linux Systems Administrator
- Validate an E-Mail Address with PHP, the Right Way
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- RSS Feeds
- Introduction to MapReduce with Hadoop on Linux
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?






1 hour 27 min ago
2 hours 53 min ago
7 hours 3 min ago
7 hours 48 min ago
7 hours 59 min ago
8 hours 4 min ago
10 hours 14 min ago
10 hours 15 min ago
11 hours 25 sec ago
11 hours 48 min ago