Manipulating OOo Documents with Ruby

Who says you have to wait for some future OS to integrate your office documents with business applications you develop? Work with's XML-based documents using Ruby.

The REXML tree parser easily lets one load XML documents:

require "rexml/document"
file = "som_xml_file.xml" )
doc = file


require "rexml/document"
my_xml_string = "<sample>
   <text>This is my REXML doc</text>
doc = my_xml_string

The Document constructor takes either a string or an I/O object; REXML figures out which it is and does the right thing. Once you have a document, you can locate elements using Ruby's Array and each syntax combined with an XPath selector:

my_xpath = "sample/text"
doc.elements.each( my_xpath ){
    |el| puts el.text }

In the above example, the each method iterates over each element matched by the XPath selector. A code block (the part inside the { ... }) is called for each iteration. The variable el is the current element in the iteration, so this example simply prints the text for each element matched by the XPath.


Our sample Writer document and its corresponding XML is quite simple, so finding what we want is close to trivial. It wouldn't take much to figure out the right element for particular content. A simple example can be best for articles such as this, but in real life we aren't likely to see anything that basic. We may know only limited details of the markup, such as the style attributes or a parent element. Finding such content becomes more of a challenge, but XPath helps save the day.

XPath is a W3C recommendation for addressing parts of an XML document. It allows one to construct a path specifier that defines location based on element and attribute names and content, plus relative or absolute positioning. Given a complex XML document, you can define an XPath expression that locates, for example, all text:p elements that are immediate children of the office:body element with this expression:


The leading asterisk says (in XPath-speak) to follow any path through the XML document tree that leads to a text:p element that is the child of an office:body element. With REXML, we can use this XPath to retrieve and iterate over a collection of matching elements:

xml.each_element( */office:body/text:p" ) do |el|
   # do something with el, such as
   # look for content or a style attribute

In this example, the code between do and end is a block. It is like an anonymous function that gets called for each item in the collection—in this case, each element matching the XPath—where the item is passed in as an argument, indicated by the two vertical bars just after “do”. This is essentially how OOoExtract works, but you should visit the OOoExtract home page for details on the numerous command-line parameters.

Toward a More General OOo API

Having seen OOoExtract, I wanted to have a more general-purpose OOo object for Ruby. The same basic ideas that drive OOoExtract could allow not only reading data, but creating, updating and deleting, for example, the CRUD operations we know and love from database tools. To this end, a project named OOo4R has been created on RubyForge, the Ruby software CVS repository. The design goals are simple access to data and metadata, transparent use of XPath and an intuitive API for doing the commonplace, such as adding paragraphs, headings and styles. Space does not allow a complete walk-though of all such features, but we can look at accessing document metadata to see one way of using Ruby's dynamic message handling to extract element content.

Earlier we saw that an OOo document has several XML files packaged in a single zip file. We looked at the content.xml file; another is meta.xml. It holds information about the document itself, such as the document title, the creation date and the word count. The root element is office:document-meta. This, in turn, contains an office:meta element that holds numerous child elements with the data of interest. For example:

<meta:initial-creator>James Britt
<dc:creator>James Britt</dc:creator>



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Manipulating OOo Documents with Ruby

Anonymous's picture

I must point out that there is at least one mistake in the article. Sean Russell is the author of REXML. I had hoped the online article might had been edited with the proper information, but in the meantime please note the correction.


James Britt