Manipulating OOo Documents with Ruby

Who says you have to wait for some future OS to integrate your office documents with business applications you develop? Work with's XML-based documents using Ruby.

And what has been zipped? Let's see:

$ unzip -l sample1.sxw
Archive:  sample1.sxw
  Length     Date   Time    Name
 --------    ----   ----    ----
       30  11-26-03 01:40   mimetype
     2328  11-26-03 01:40   content.xml
     8358  11-26-03 01:40   styles.xml
     1159  11-26-03 01:40   meta.xml
     7021  11-26-03 01:40   settings.xml
      752  11-26-03 01:40   META-INF/manifest.xml
 --------                   -------
    19648                   6 files

The OOo XML format exposes all content and metadata in plain text; there is no need to worry about cryptic binary encoding or complex layout. Because the data is exposed as XML, numerous existing XML tools are available for extra OOo parsing. Having the file in plain text means, of course, that anything you might want to know about the file is available if you simply look. However, we get a good deal of help because the team also provides assorted documentation detailing the format. The technical reference manual for XML File Format 1.0 is a 571-page PDF document. I confess to not having read the entire tome, though I doubt it lacks any detail one might care to find.

For our purposes, we need look only at some basic markup to see how OOoExtract works and to gain some understanding of the markup.

If you unzip our sample document and load content.xml into a text editor, you should notice a few things. First, the file is not formatted for your viewing pleasure. You may want to run the file though an XML-formatting tool, such as tidy, to get some new lines and indentations in place to make it easier to follow.

The file starts with an XML declaration, followed by a DOCTYPE reference. Right after that comes the root element, office:document-content. The beginning tag has a good number of XML namespace attributes. We needn't be concerned with these, but they give some idea of the range of content one might find in an OOo document.

Immediately inside the root element we find child elements for scripts, font declarations and styles. As ours is a fairly simple document, the data here is sparse. For our immediate interests, the useful stuff comes inside the office:body element. Yet, even here, a few elements simply declare the presence (or, in our case, the absence) of various items, such as tables and illustrations. The full document is available from the Linux Journal FTP site [].

The real content in our document appears inside of text:p elements:

<text:p text:style-name="Standard">My sample
<text:p text:style-name="Standard">It has two
<text:p text:style-name="Footer">This one has
some extra formatting</text:p>

Incidentally, if you are unfamiliar with some of the details of XML syntax, this notation simply says that it is a p element, defined in the text namespace. The use of the prefix and colon is a shorthand way to reference the namespace URI given at the top of the document. It's used to avoid name collisions with other p elements that may be defined for some other XML vocabulary. For our purposes we can simply think of it as one complete element name.

Our sample document had only three paragraphs, so as we might expect, there are three text:p elements. Each one has a text:style-name attribute that indicates a style to apply to the text. It is this attribute that lets OOoExtract locate text based on styles.

You may be wondering about the Footer style. Our content.xml file does not define it, and indeed this separation of style name from implementation detail is good. It would be a shame if instead of a simple name, the document had assorted attributes for font size and family, color and so on. The ability to locate content based on semantic or structural data would be lost, and we would be confined to treating the data strictly in rendering terms. If you really do want to see how OOo defined the Footer style, you can peer into styles.xml. There you'll find that Footer is based on the Standard style, with a few changes.

From Zip to REXML

It's all well and good that uses zipped XML, but once we've extracted these files, what is next? Lucky for us, Ruby 1.8 includes an outstanding XML parser, REXML. REXML is an XML 1.0 conformant parser, and in addition to its own Ruby-style API, it provides full implementations of XPath and SAX2. It was developed and is maintained by Sean Chittenden. Sean says he wrote REXML because, at the time, there were only two choices for XML parsing with Ruby. One was a binding to a native C parser, a possible limitation on portability. The other was pure Ruby, but in Sean's view, it lacked a suitable API. Sean was familiar with various Java XML parsers but disliked their adherence to the W3C's DOM or the community-driven SAX. The designers of Electric XML offered an API based on known Java idioms, one that readily would be intuitive to Java programmers.

Such was the philosophy behind the REXML API; the name stands for Ruby Electric XML. Not surprisingly, though, the REXML API moved from the Java-flavored original to a Ruby-way design, allowing developers to access and manipulate XML using the syntax and features, such as blocks and built-in iterators, common to Ruby.



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Manipulating OOo Documents with Ruby

Anonymous's picture

I must point out that there is at least one mistake in the article. Sean Russell is the author of REXML. I had hoped the online article might had been edited with the proper information, but in the meantime please note the correction.


James Britt