OpenOffice.org ODF, Python and XML
My wife is a writer, which today means she uses a word processing program. It's a sophisticated, powerful program—OpenOffice.org Writer—but occasionally it won't do something that she wants it to do. In this article, we take a look at the structure of OpenDocument Format (ODF) files and see how Python, with its XML libraries, can help. Figure 1 shows an example.
It's not hard to convert quotation marks on a few paragraphs by hand—or even on a few pages, if I'm doing it only once. But having to repeat such manual operations on subsequent revisions becomes tedious, especially on a longer document, such as a poetry collection or novel. (We might have to repeat these operations after importing plain text from an e-mail message, for example.)
Fortunately, ODF is open, so we should be able to manipulate the file contents outside the word processing program.
Let's see if we can do that manually, just to make sure we know what we're doing. Once we can do that, we'll create a script to do some more ambitious things with the document.
I read somewhere that an ODF file is a zip archive of XML files. So, let's see if it really is one—and if so, what's inside:
% unzip -l ex1.odt Archive: ex1.odt Length Date Time Name -------- ---- ---- ---- 39 11-15-06 01:55 mimetype 0 11-15-06 01:55 Configurations2/statusbar/ 0 11-15-06 01:55 Configurations2/accelerator/current.xml 0 11-15-06 01:55 Configurations2/floater/ 0 11-15-06 01:55 Configurations2/popupmenu/ 0 11-15-06 01:55 Configurations2/progressbar/ 0 11-15-06 01:55 Configurations2/menubar/ 0 11-15-06 01:55 Configurations2/toolbar/ 0 11-15-06 01:55 Configurations2/images/Bitmaps/ 0 11-15-06 01:55 Pictures/ 2872 11-15-06 01:55 content.xml 9786 11-15-06 01:55 styles.xml 1109 11-15-06 01:55 meta.xml 878 11-15-06 01:55 Thumbnails/thumbnail.png 6611 11-15-06 01:55 settings.xml 2037 11-15-06 01:55 META-INF/manifest.xml -------- ------- 23332 16 files %
Good news—it is a zip archive.
So, the plan is this: unpack it, modify a file (or files) and pack everything back up again. We'll pack up files in the same order, just in case it matters. So, we need to save the file list.
The listing from running unzip has that file list, along with some other stuff. Let's select only the lines that have filenames (in this case, the lines with a : followed by digits) and print only the filenames. A single command to sed does that:
% unzip -l ex1.odt | sed -n '/:[0-9][0-9]/s|^.*:.. *||p' mimetype Configurations2/statusbar/ Configurations2/accelerator/current.xml Configurations2/floater/ Configurations2/popupmenu/ Configurations2/progressbar/ Configurations2/menubar/ Configurations2/toolbar/ Configurations2/images/Bitmaps/ Pictures/ content.xml styles.xml meta.xml Thumbnails/thumbnail.png settings.xml META-INF/manifest.xml %
Looks good. Let's save the list in a shell variable—we'll use F (for files):
% F=$(unzip -l ex1.odt | sed -n '/:[0-9][0-9]/s|^.*:.. *||p')
With that settled, the next question is, which file to modify? To find out, let's find the file or files containing the word quotes, which appeared in the document. We'll unpack ex1.odt into an empty directory and ask grep, remembering to check files in subdirectories as well:
% cd TMP % unzip -q ~/oo/ex1.odt % find . -type f | xargs grep -l quote ./content.xml %
Okay, content.xml is it. Text editors provide one way to manipulate content.xml, so let's give that a try. The relevant part looked like Figure 2 in Emacs.
The two occurrences of " (partially highlighted in Figure 2) represent the straight quotation marks.
I changed the straight quotes to the appropriate curly or smart quotes (found on either side of the word nice), as shown in Figure 3. The changed areas are, again, partially highlighted.
With that done, let's zip the files (the list saved in $F) to create ex2.odt, and see what OpenOffice.org Writer thinks about it:
% zip -q ~/oo/ex2.odt $F % oowriter ~/oo/ex2.odt
It worked (Figure 4)! The formerly straight quotes around the word straight are now curly quotes, and they're even curled in the right direction. So, to review what we've done so far:
Created a list of the files in ex1.odt (saving it in $F).
Made a simple change, manually, in content.xml.
Created ex2.odt (using $F).
Validated ex2.odt using OpenOffice.org Writer.
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
|CentOS 6.8 Released||May 27, 2016|
|Secure Desktops with Qubes: Introduction||May 27, 2016|
|Chris Birchall's Re-Engineering Legacy Software (Manning Publications)||May 26, 2016|
|ServersCheck's Thermal Imaging Camera Sensor||May 25, 2016|
|Petros Koutoupis' RapidDisk||May 24, 2016|
|The Italian Army Switches to LibreOffice||May 23, 2016|
- Secure Desktops with Qubes: Introduction
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- CentOS 6.8 Released
- The Italian Army Switches to LibreOffice
- Linux Mint 18
- Chris Birchall's Re-Engineering Legacy Software (Manning Publications)
- ServersCheck's Thermal Imaging Camera Sensor
- Oracle vs. Google: Round 2
- Petros Koutoupis' RapidDisk
- The FBI and the Mozilla Foundation Lock Horns over Known Security Hole
Until recently, IBM’s Power Platform was looked upon as being the system that hosted IBM’s flavor of UNIX and proprietary operating system called IBM i. These servers often are found in medium-size businesses running ERP, CRM and financials for on-premise customers. By enabling the Power platform to run the Linux OS, IBM now has positioned Power to be the platform of choice for those already running Linux that are facing scalability issues, especially customers looking at analytics, big data or cloud computing.
￼Running Linux on IBM’s Power hardware offers some obvious benefits, including improved processing speed and memory bandwidth, inherent security, and simpler deployment and management. But if you look beyond the impressive architecture, you’ll also find an open ecosystem that has given rise to a strong, innovative community, as well as an inventory of system and network management applications that really help leverage the benefits offered by running Linux on Power.Get the Guide