Manipulating OOo Documents with Ruby
And what has been zipped? Let's see:
$ unzip -l sample1.sxw Archive: sample1.sxw Length Date Time Name -------- ---- ---- ---- 30 11-26-03 01:40 mimetype 2328 11-26-03 01:40 content.xml 8358 11-26-03 01:40 styles.xml 1159 11-26-03 01:40 meta.xml 7021 11-26-03 01:40 settings.xml 752 11-26-03 01:40 META-INF/manifest.xml -------- ------- 19648 6 files
The OOo XML format exposes all content and metadata in plain text; there is no need to worry about cryptic binary encoding or complex layout. Because the data is exposed as XML, numerous existing XML tools are available for extra OOo parsing. Having the file in plain text means, of course, that anything you might want to know about the file is available if you simply look. However, we get a good deal of help because the OpenOffice.org team also provides assorted documentation detailing the format. The technical reference manual for OpenOffice.org XML File Format 1.0 is a 571-page PDF document. I confess to not having read the entire tome, though I doubt it lacks any detail one might care to find.
For our purposes, we need look only at some basic markup to see how OOoExtract works and to gain some understanding of the markup.
If you unzip our sample document and load content.xml into a text editor, you should notice a few things. First, the file is not formatted for your viewing pleasure. You may want to run the file though an XML-formatting tool, such as tidy, to get some new lines and indentations in place to make it easier to follow.
The file starts with an XML declaration, followed by a DOCTYPE reference. Right after that comes the root element, office:document-content. The beginning tag has a good number of XML namespace attributes. We needn't be concerned with these, but they give some idea of the range of content one might find in an OOo document.
Immediately inside the root element we find child elements for scripts, font declarations and styles. As ours is a fairly simple document, the data here is sparse. For our immediate interests, the useful stuff comes inside the office:body element. Yet, even here, a few elements simply declare the presence (or, in our case, the absence) of various items, such as tables and illustrations. The full document is available from the Linux Journal FTP site [ftp.linuxjournal.com/pub/lj/listings/issue119/7236.tgz].
The real content in our document appears inside of text:p elements:
<text:p text:style-name="Standard">My sample document</text:p> <text:p text:style-name="Standard">It has two lines</text:p> <text:p text:style-name="Footer">This one has some extra formatting</text:p>
Incidentally, if you are unfamiliar with some of the details of XML syntax, this notation simply says that it is a p element, defined in the text namespace. The use of the prefix and colon is a shorthand way to reference the namespace URI given at the top of the document. It's used to avoid name collisions with other p elements that may be defined for some other XML vocabulary. For our purposes we can simply think of it as one complete element name.
Our sample document had only three paragraphs, so as we might expect, there are three text:p elements. Each one has a text:style-name attribute that indicates a style to apply to the text. It is this attribute that lets OOoExtract locate text based on styles.
You may be wondering about the Footer style. Our content.xml file does not define it, and indeed this separation of style name from implementation detail is good. It would be a shame if instead of a simple name, the document had assorted attributes for font size and family, color and so on. The ability to locate content based on semantic or structural data would be lost, and we would be confined to treating the data strictly in rendering terms. If you really do want to see how OOo defined the Footer style, you can peer into styles.xml. There you'll find that Footer is based on the Standard style, with a few changes.
It's all well and good that OpenOffice.org uses zipped XML, but once we've extracted these files, what is next? Lucky for us, Ruby 1.8 includes an outstanding XML parser, REXML. REXML is an XML 1.0 conformant parser, and in addition to its own Ruby-style API, it provides full implementations of XPath and SAX2. It was developed and is maintained by Sean Chittenden. Sean says he wrote REXML because, at the time, there were only two choices for XML parsing with Ruby. One was a binding to a native C parser, a possible limitation on portability. The other was pure Ruby, but in Sean's view, it lacked a suitable API. Sean was familiar with various Java XML parsers but disliked their adherence to the W3C's DOM or the community-driven SAX. The designers of Electric XML offered an API based on known Java idioms, one that readily would be intuitive to Java programmers.
Such was the philosophy behind the REXML API; the name stands for Ruby Electric XML. Not surprisingly, though, the REXML API moved from the Java-flavored original to a Ruby-way design, allowing developers to access and manipulate XML using the syntax and features, such as blocks and built-in iterators, common to Ruby.
- Linux Kernel Testing and Debugging
- Tails above the Rest, Part III
- Wanted: Your Embedded Linux Projects
- NSA: Linux Journal is an "extremist forum" and its readers get flagged for extra surveillance
- The 101 Uses of OpenSSH: Part I
- RSS Feeds
- Docker: Lightweight Linux Containers for Consistent Development and Deployment
- Dolphins in the NSA Dragnet
- Tails above the Rest, Part II
- Are you an extremist?