Manipulating OOo Documents with Ruby

Who says you have to wait for some future OS to integrate your office documents with business applications you develop? Work with OpenOffice.org's XML-based documents using Ruby.
The REXML API

The REXML tree parser easily lets one load XML documents:

require "rexml/document"
file = File.new( "som_xml_file.xml" )
doc = REXML::Document.new file

or:


require "rexml/document"
my_xml_string = "<sample>
   <text>This is my REXML doc</text>
   </sample>"
doc = REXML::Document.new my_xml_string

The Document constructor takes either a string or an I/O object; REXML figures out which it is and does the right thing. Once you have a document, you can locate elements using Ruby's Array and each syntax combined with an XPath selector:

my_xpath = "sample/text"
doc.elements.each( my_xpath ){
    |el| puts el.text }

In the above example, the each method iterates over each element matched by the XPath selector. A code block (the part inside the { ... }) is called for each iteration. The variable el is the current element in the iteration, so this example simply prints the text for each element matched by the XPath.

XPath

Our sample Writer document and its corresponding XML is quite simple, so finding what we want is close to trivial. It wouldn't take much to figure out the right element for particular content. A simple example can be best for articles such as this, but in real life we aren't likely to see anything that basic. We may know only limited details of the markup, such as the style attributes or a parent element. Finding such content becomes more of a challenge, but XPath helps save the day.

XPath is a W3C recommendation for addressing parts of an XML document. It allows one to construct a path specifier that defines location based on element and attribute names and content, plus relative or absolute positioning. Given a complex XML document, you can define an XPath expression that locates, for example, all text:p elements that are immediate children of the office:body element with this expression:

*/office:body/text:p

The leading asterisk says (in XPath-speak) to follow any path through the XML document tree that leads to a text:p element that is the child of an office:body element. With REXML, we can use this XPath to retrieve and iterate over a collection of matching elements:

xml.each_element( */office:body/text:p" ) do |el|
   # do something with el, such as
   # look for content or a style attribute
end

In this example, the code between do and end is a block. It is like an anonymous function that gets called for each item in the collection—in this case, each element matching the XPath—where the item is passed in as an argument, indicated by the two vertical bars just after “do”. This is essentially how OOoExtract works, but you should visit the OOoExtract home page for details on the numerous command-line parameters.

Toward a More General OOo API

Having seen OOoExtract, I wanted to have a more general-purpose OOo object for Ruby. The same basic ideas that drive OOoExtract could allow not only reading data, but creating, updating and deleting, for example, the CRUD operations we know and love from database tools. To this end, a project named OOo4R has been created on RubyForge, the Ruby software CVS repository. The design goals are simple access to data and metadata, transparent use of XPath and an intuitive API for doing the commonplace, such as adding paragraphs, headings and styles. Space does not allow a complete walk-though of all such features, but we can look at accessing document metadata to see one way of using Ruby's dynamic message handling to extract element content.

Earlier we saw that an OOo document has several XML files packaged in a single zip file. We looked at the content.xml file; another is meta.xml. It holds information about the document itself, such as the document title, the creation date and the word count. The root element is office:document-meta. This, in turn, contains an office:meta element that holds numerous child elements with the data of interest. For example:


<meta:initial-creator>James Britt
</meta:initial-creator>
<meta:creation-date>2003-11-25T17:36:31
</meta:creation-date>
<dc:creator>James Britt</dc:creator>
<dc:date>2003-11-25T18:40:59</dc:date>
<dc:language>en-US</dc:language>
<meta:editing-cycles>13</meta:editing-cycles>

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Manipulating OOo Documents with Ruby

Anonymous's picture

I must point out that there is at least one mistake in the article. Sean Russell is the author of REXML. I had hoped the online article might had been edited with the proper information, but in the meantime please note the correction.

Thanks,

James Britt

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState