Manipulating OOo Documents with Ruby

Software

by James Britt

on February 29, 2004

OpenOffice.org (OOo), a featureful suite of office tools that includes applications for word processing, spreadsheet creation and presentation authoring, has seen an increase in enhancements and overall quality. OOo lives up to its name by making both source code and file formats completely open. This is a big plus for anyone wishing to manipulate documents without needing to have the creator application present.

In general, two ways exist to access or manipulate document content. One is to automate the source application, letting a program substitute for a person entering commands. The other is to go directly to the document. An advantage of the first approach is you get to exploit the power of an existing application, saving yourself a good deal of time figuring out file formats and processing commands. OOo can execute internal macros and expose a scripting interface through UNO. The downside is you need to have the actual application handy, and even then it may not be able to do what you want. This article describes the second approach: accessing and manipulating documents by going directly to the source.

OOo Extract

I first became aware of what could be done with an OpenOffice.org document when Daniel Carrera announced his OOoExtract program. This is a Ruby application that allows you to run command-line searches of OOo Writer document content. As the home page states, OOoExtract performs matches on both text content and styles, executes search patterns using full regular expressions and runs searches built with Boolean operators. The program runs on any platform that has a Ruby interpreter, and they are available for pretty much every OS around.

Ruby has been discussed before in Linux Journal, but if you are not familiar with it, a good though brief description might be to say it's a cross between Perl and Smalltalk, with some features from Lisp and Python. It is deeply object-oriented and has a clean intuitive syntax. Yukihiro “Matz” Matsumoto, its creator, released the first alpha version in 1994. It has grown steadily in popularity, and the Third International Ruby Conference was held in November 2003, in Austin, Texas.

To get a feel for OOoExtract, download the program; currently, you can get the application as a single executable file or as a tarball with constituent libraries in separate files. Once installed, we can create a simple Writer document and run some searches. If you have OOo handy, fire it up and enter some brief text, such as:

My sample document
It has two lines

Save the file as sample1.sxw to the same directory where you installed OOoExtract, and run OOoExtract from the command line, like this:

./ooo_extract.rb --text sample sample1.sxw
My sample document

The program searches sample1.sxw for any lines that match on the word sample. Actually, this is a regular expression, albeit a simple one. We also can use more complex expressions, such as this one that matches any three-letter word:

./ooo_extract.rb --text "\s\w\w\w\s"  sample1.sxw
It has two lines

This is all well and good, but OOoExtract really shines by letting us search on content metadata, the extra information about the text in our document. Suppose we add an additional line to our sample Writer document:

This one has some extra formatting

After entering the text, select the word extra and apply the Footer paragraph style. Save the file and run this search:

./ooo_extract.rb --style="Footer"  sample1.sxw
This one has some extra formatting

In addition to locating text based on content, OOoExtract also can give you text with specific markup. This is quite handy if you create your own semantically rich styles. You then can use OOoExtract to retrieve information based on content and meaning, effectively turning an OpenOffice.org Writer document into a lightweight database. You can run the program against multiple files by using wild cards in the filename. For example, suppose you store recipes in Writer files. If you've defined and used custom styles, you could locate specific information, such as what recipes have apples as an ingredient:

./ooo_extract.rb --text="apple" --style="Ingredient" recipes/*.sxw
AppleSalsa.sxw: 2 medium red apples
AppleStrudel.sxw: 4 cups peeled and sliced apples

The SXW File Format

So, how does OOoExtract do its magic? The secret is in the file format. Although any given Writer file has an sxw file extension, running the UNIX file command tells us that it is a zip file:

$ file sample1.sxw
sample1.sxw: Zip archive data, at least v2.0 to extract

And what has been zipped? Let's see:

$ unzip -l sample1.sxw
Archive:  sample1.sxw
  Length     Date   Time    Name
 --------    ----   ----    ----
       30  11-26-03 01:40   mimetype
     2328  11-26-03 01:40   content.xml
     8358  11-26-03 01:40   styles.xml
     1159  11-26-03 01:40   meta.xml
     7021  11-26-03 01:40   settings.xml
      752  11-26-03 01:40   META-INF/manifest.xml
 --------                   -------
    19648                   6 files

The OOo XML format exposes all content and metadata in plain text; there is no need to worry about cryptic binary encoding or complex layout. Because the data is exposed as XML, numerous existing XML tools are available for extra OOo parsing. Having the file in plain text means, of course, that anything you might want to know about the file is available if you simply look. However, we get a good deal of help because the OpenOffice.org team also provides assorted documentation detailing the format. The technical reference manual for OpenOffice.org XML File Format 1.0 is a 571-page PDF document. I confess to not having read the entire tome, though I doubt it lacks any detail one might care to find.

For our purposes, we need look only at some basic markup to see how OOoExtract works and to gain some understanding of the markup.

If you unzip our sample document and load content.xml into a text editor, you should notice a few things. First, the file is not formatted for your viewing pleasure. You may want to run the file though an XML-formatting tool, such as tidy, to get some new lines and indentations in place to make it easier to follow.

The file starts with an XML declaration, followed by a DOCTYPE reference. Right after that comes the root element, office:document-content. The beginning tag has a good number of XML namespace attributes. We needn't be concerned with these, but they give some idea of the range of content one might find in an OOo document.

Immediately inside the root element we find child elements for scripts, font declarations and styles. As ours is a fairly simple document, the data here is sparse. For our immediate interests, the useful stuff comes inside the office:body element. Yet, even here, a few elements simply declare the presence (or, in our case, the absence) of various items, such as tables and illustrations. The full document is available from the Linux Journal FTP site [ftp.linuxjournal.com/pub/lj/listings/issue119/7236.tgz].

The real content in our document appears inside of text:p elements:


<text:p text:style-name="Standard">My sample
document</text:p>
<text:p text:style-name="Standard">It has two
lines</text:p>
<text:p text:style-name="Footer">This one has
some extra formatting</text:p>

Incidentally, if you are unfamiliar with some of the details of XML syntax, this notation simply says that it is a p element, defined in the text namespace. The use of the prefix and colon is a shorthand way to reference the namespace URI given at the top of the document. It's used to avoid name collisions with other p elements that may be defined for some other XML vocabulary. For our purposes we can simply think of it as one complete element name.

Our sample document had only three paragraphs, so as we might expect, there are three text:p elements. Each one has a text:style-name attribute that indicates a style to apply to the text. It is this attribute that lets OOoExtract locate text based on styles.

You may be wondering about the Footer style. Our content.xml file does not define it, and indeed this separation of style name from implementation detail is good. It would be a shame if instead of a simple name, the document had assorted attributes for font size and family, color and so on. The ability to locate content based on semantic or structural data would be lost, and we would be confined to treating the data strictly in rendering terms. If you really do want to see how OOo defined the Footer style, you can peer into styles.xml. There you'll find that Footer is based on the Standard style, with a few changes.

From Zip to REXML

It's all well and good that OpenOffice.org uses zipped XML, but once we've extracted these files, what is next? Lucky for us, Ruby 1.8 includes an outstanding XML parser, REXML. REXML is an XML 1.0 conformant parser, and in addition to its own Ruby-style API, it provides full implementations of XPath and SAX2. It was developed and is maintained by Sean Chittenden. Sean says he wrote REXML because, at the time, there were only two choices for XML parsing with Ruby. One was a binding to a native C parser, a possible limitation on portability. The other was pure Ruby, but in Sean's view, it lacked a suitable API. Sean was familiar with various Java XML parsers but disliked their adherence to the W3C's DOM or the community-driven SAX. The designers of Electric XML offered an API based on known Java idioms, one that readily would be intuitive to Java programmers.

Such was the philosophy behind the REXML API; the name stands for Ruby Electric XML. Not surprisingly, though, the REXML API moved from the Java-flavored original to a Ruby-way design, allowing developers to access and manipulate XML using the syntax and features, such as blocks and built-in iterators, common to Ruby.

The REXML API

The REXML tree parser easily lets one load XML documents:

require "rexml/document"
file = File.new( "som_xml_file.xml" )
doc = REXML::Document.new file

or:


require "rexml/document"
my_xml_string = "<sample>
   <text>This is my REXML doc</text>
   </sample>"
doc = REXML::Document.new my_xml_string

The Document constructor takes either a string or an I/O object; REXML figures out which it is and does the right thing. Once you have a document, you can locate elements using Ruby's Array and each syntax combined with an XPath selector:

my_xpath = "sample/text"
doc.elements.each( my_xpath ){
    |el| puts el.text }

In the above example, the each method iterates over each element matched by the XPath selector. A code block (the part inside the { ... }) is called for each iteration. The variable el is the current element in the iteration, so this example simply prints the text for each element matched by the XPath.

XPath

Our sample Writer document and its corresponding XML is quite simple, so finding what we want is close to trivial. It wouldn't take much to figure out the right element for particular content. A simple example can be best for articles such as this, but in real life we aren't likely to see anything that basic. We may know only limited details of the markup, such as the style attributes or a parent element. Finding such content becomes more of a challenge, but XPath helps save the day.

XPath is a W3C recommendation for addressing parts of an XML document. It allows one to construct a path specifier that defines location based on element and attribute names and content, plus relative or absolute positioning. Given a complex XML document, you can define an XPath expression that locates, for example, all text:p elements that are immediate children of the office:body element with this expression:

*/office:body/text:p

The leading asterisk says (in XPath-speak) to follow any path through the XML document tree that leads to a text:p element that is the child of an office:body element. With REXML, we can use this XPath to retrieve and iterate over a collection of matching elements:

xml.each_element( */office:body/text:p" ) do |el|
   # do something with el, such as
   # look for content or a style attribute
end

In this example, the code between do and end is a block. It is like an anonymous function that gets called for each item in the collection—in this case, each element matching the XPath—where the item is passed in as an argument, indicated by the two vertical bars just after “do”. This is essentially how OOoExtract works, but you should visit the OOoExtract home page for details on the numerous command-line parameters.

Toward a More General OOo API

Having seen OOoExtract, I wanted to have a more general-purpose OOo object for Ruby. The same basic ideas that drive OOoExtract could allow not only reading data, but creating, updating and deleting, for example, the CRUD operations we know and love from database tools. To this end, a project named OOo4R has been created on RubyForge, the Ruby software CVS repository. The design goals are simple access to data and metadata, transparent use of XPath and an intuitive API for doing the commonplace, such as adding paragraphs, headings and styles. Space does not allow a complete walk-though of all such features, but we can look at accessing document metadata to see one way of using Ruby's dynamic message handling to extract element content.

Earlier we saw that an OOo document has several XML files packaged in a single zip file. We looked at the content.xml file; another is meta.xml. It holds information about the document itself, such as the document title, the creation date and the word count. The root element is office:document-meta. This, in turn, contains an office:meta element that holds numerous child elements with the data of interest. For example:


<meta:initial-creator>James Britt
</meta:initial-creator>
<meta:creation-date>2003-11-25T17:36:31
</meta:creation-date>
<dc:creator>James Britt</dc:creator>
<dc:date>2003-11-25T18:40:59</dc:date>
<dc:language>en-US</dc:language>
<meta:editing-cycles>13</meta:editing-cycles>

The full metadata file is available from the Linux Journal FTP site [ftp.linuxjournal.com/pub/lj/listings/issue119/7236.tgz].

In addition to a main Document class, OOo4R defines a meta class to encapsulate the metadata. A meta class uses an REXML document to hold the contents of meta.xml. A meta object largely is a collection of attributes. Typical usage either would be asking an object for a particular value, such as the name of the author, or assigning a value, such as a new title. One way to code this would be to write a series of explicit attribute accessor methods. We would need two methods for every attribute. Or, we could use dynamic method invocation by grabbing accessor messages, finding a matching meta attribute and either performing the requested action on the corresponding attribute or raising an exception.

The following code example focuses on the Dublin Core metadata elements used in OOo. The Dublin Core Metadata Initiative is an open forum for defining metadata standards. Dublin Core elements often can be found in RSS feeds and some XHTML documents. As with all elements in an OpenOffice.org XML file, the elements have a namespace prefix. Rather than have users know and use these prefixes, we can map the full element name to something friendly.

The definition of the Meta class begins with the creation of a hash that maps friendly names to actual element names, plus a class constant to hold the base XPath for the metadata. The class constructor simply creates an REXML document from the XML source:


module OOo
  class Meta

  NAME_MAP = {
   'description' => 'dc:description',
   'subject'     => 'dc:subject',
   'creator'     => 'dc:creator',
   'author '     => 'dc:creator',
   'date'        => 'dc:date',
   'language'    => 'dc:language',
   'title'       => 'dc:title'
  }
    XPATH_BASE  = "*/office:meta"

    def initialize( src )
      @doc = REXML::Document.new(  src.to_s )
    end

We can redefine the method_missing method available to all Ruby classes so that, rather than raising an exception (as it would do by default), it looks to see if the message sent to the object maps to some item in our metadata:

def method_missing( name, *args )
  n = name.to_s
  if is_assignment? n
    el = map_for_assignment n
    xpath = "#{XPATH_BASE}/#{el}"
    assign( xpath, *args)
  else
    el = Meta.map_name n
    xpath = "#{XPATH_BASE}/#{el}"
    find( xpath  )
  end
end

The first argument to method_missing is a symbol object, so our code grabs the string representation. The is_assignment method simply checks if the name ends with an = character. If this is an assignment request, then map_for_assignment removes any trailing characters following the metadata name and maps the friendly name to the actual Dublin Core element name; assign updates the corresponding element in the REXML document:

def assign( xpath, val )
  node = @doc.elements.to_a( xpath )[0]
  node.text = val
end

If this does not appear to be an assignment, the code tries to read some metadata. As before, the name is mapped, but now the code calls find:

def find( xpath )
 begin
  return @doc.elements.to_a( xpath.to_s )[0].text
 rescue Exception
  raise OOo::OOoException.new(
     "Error with xpath '#{xpath}': #{$!}", $@ )
 end
end

# Helper methods omitted ...

 end
end

The technique works for accessing the other metadata elements, though there are special cases where the metadata is contained in a series of child elements. Updating the zip file contents and writing the zip file back to disk using Ruby's built-in Zip class, lets us save modified OOo documents.

Summary

Because the OpenOffice.org file format uses a fully documented XML format, OOo files may be created or manipulated without requiring OOo itself. Ruby's built-in XML handling and dynamic nature make it a natural fit for OOo tasks.

Resources

Dublin Core: dublincore.org

OOo_extract: www.math.umd.edu/~dcarrera/openoffice/tools/ooo_extract.html

OOo Formats: xml.coverpages.org/starOfficeXML.html

OpenOffice.org XML: xml.openoffice.org

Ruby: linux.oreillynet.com/pub/a/linux/2001/11/29/ruby.html

RubyForge: www.rubyforge.org

“Thinking XML: The open office file format”: www-106.ibm.com/developerworks/xml/library/x-think15

XPath: www.w3.org/TR/xpath

James Britt runs Neurogami, LCC, a software and design company in Scottsdale, Arizona. He has coauthored a book on XML for the Wrox Press, written various articles on software development and gave a presentation on Ruby and XML at the Third International Ruby Conference in Austin, Texas. He can be reached at jamesgb@neurogami.com.

Load Disqus comments