OpenOffice.org ODF, Python and XML
My wife is a writer, which today means she uses a word processing program. It's a sophisticated, powerful program—OpenOffice.org Writer—but occasionally it won't do something that she wants it to do. In this article, we take a look at the structure of OpenDocument Format (ODF) files and see how Python, with its XML libraries, can help. Figure 1 shows an example.
It's not hard to convert quotation marks on a few paragraphs by hand—or even on a few pages, if I'm doing it only once. But having to repeat such manual operations on subsequent revisions becomes tedious, especially on a longer document, such as a poetry collection or novel. (We might have to repeat these operations after importing plain text from an e-mail message, for example.)
Fortunately, ODF is open, so we should be able to manipulate the file contents outside the word processing program.
Let's see if we can do that manually, just to make sure we know what we're doing. Once we can do that, we'll create a script to do some more ambitious things with the document.
I read somewhere that an ODF file is a zip archive of XML files. So, let's see if it really is one—and if so, what's inside:
% unzip -l ex1.odt
Archive: ex1.odt
Length Date Time Name
-------- ---- ---- ----
39 11-15-06 01:55 mimetype
0 11-15-06 01:55 Configurations2/statusbar/
0 11-15-06 01:55 Configurations2/accelerator/current.xml
0 11-15-06 01:55 Configurations2/floater/
0 11-15-06 01:55 Configurations2/popupmenu/
0 11-15-06 01:55 Configurations2/progressbar/
0 11-15-06 01:55 Configurations2/menubar/
0 11-15-06 01:55 Configurations2/toolbar/
0 11-15-06 01:55 Configurations2/images/Bitmaps/
0 11-15-06 01:55 Pictures/
2872 11-15-06 01:55 content.xml
9786 11-15-06 01:55 styles.xml
1109 11-15-06 01:55 meta.xml
878 11-15-06 01:55 Thumbnails/thumbnail.png
6611 11-15-06 01:55 settings.xml
2037 11-15-06 01:55 META-INF/manifest.xml
-------- -------
23332 16 files
%Good news—it is a zip archive.
So, the plan is this: unpack it, modify a file (or files) and pack everything back up again. We'll pack up files in the same order, just in case it matters. So, we need to save the file list.
The listing from running unzip has that file list, along with some other stuff. Let's select only the lines that have filenames (in this case, the lines with a : followed by digits) and print only the filenames. A single command to sed does that:
% unzip -l ex1.odt | sed -n '/:[0-9][0-9]/s|^.*:.. *||p' mimetype Configurations2/statusbar/ Configurations2/accelerator/current.xml Configurations2/floater/ Configurations2/popupmenu/ Configurations2/progressbar/ Configurations2/menubar/ Configurations2/toolbar/ Configurations2/images/Bitmaps/ Pictures/ content.xml styles.xml meta.xml Thumbnails/thumbnail.png settings.xml META-INF/manifest.xml %
Looks good. Let's save the list in a shell variable—we'll use F (for files):
% F=$(unzip -l ex1.odt | sed -n '/:[0-9][0-9]/s|^.*:.. *||p')
With that settled, the next question is, which file to modify? To find out, let's find the file or files containing the word quotes, which appeared in the document. We'll unpack ex1.odt into an empty directory and ask grep, remembering to check files in subdirectories as well:
% cd TMP % unzip -q ~/oo/ex1.odt % find . -type f | xargs grep -l quote ./content.xml %
Okay, content.xml is it. Text editors provide one way to manipulate content.xml, so let's give that a try. The relevant part looked like Figure 2 in Emacs.
The two occurrences of " (partially highlighted in Figure 2) represent the straight quotation marks.
I changed the straight quotes to the appropriate curly or smart quotes (found on either side of the word nice), as shown in Figure 3. The changed areas are, again, partially highlighted.
With that done, let's zip the files (the list saved in $F) to create ex2.odt, and see what OpenOffice.org Writer thinks about it:
% zip -q ~/oo/ex2.odt $F % oowriter ~/oo/ex2.odt
It worked (Figure 4)! The formerly straight quotes around the word straight are now curly quotes, and they're even curled in the right direction. So, to review what we've done so far:
Created a list of the files in ex1.odt (saving it in $F).
Unpacked ex1.odt.
Made a simple change, manually, in content.xml.
Created ex2.odt (using $F).
Validated ex2.odt using OpenOffice.org Writer.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- New Products
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Using Salt Stack and Vagrant for Drupal Development
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?








4 hours 49 min ago
5 hours 5 min ago
6 hours 56 min ago
12 hours 48 min ago
17 hours 20 min ago
17 hours 20 min ago
19 hours 20 min ago
1 day 4 hours ago
1 day 4 hours ago
1 day 5 hours ago