Manipulating OOo Documents with Ruby
OpenOffice.org (OOo), a featureful suite of office tools that includes applications for word processing, spreadsheet creation and presentation authoring, has seen an increase in enhancements and overall quality. OOo lives up to its name by making both source code and file formats completely open. This is a big plus for anyone wishing to manipulate documents without needing to have the creator application present.
In general, two ways exist to access or manipulate document content. One is to automate the source application, letting a program substitute for a person entering commands. The other is to go directly to the document. An advantage of the first approach is you get to exploit the power of an existing application, saving yourself a good deal of time figuring out file formats and processing commands. OOo can execute internal macros and expose a scripting interface through UNO. The downside is you need to have the actual application handy, and even then it may not be able to do what you want. This article describes the second approach: accessing and manipulating documents by going directly to the source.
I first became aware of what could be done with an OpenOffice.org document when Daniel Carrera announced his OOoExtract program. This is a Ruby application that allows you to run command-line searches of OOo Writer document content. As the home page states, OOoExtract performs matches on both text content and styles, executes search patterns using full regular expressions and runs searches built with Boolean operators. The program runs on any platform that has a Ruby interpreter, and they are available for pretty much every OS around.
Ruby has been discussed before in Linux Journal, but if you are not familiar with it, a good though brief description might be to say it's a cross between Perl and Smalltalk, with some features from Lisp and Python. It is deeply object-oriented and has a clean intuitive syntax. Yukihiro “Matz” Matsumoto, its creator, released the first alpha version in 1994. It has grown steadily in popularity, and the Third International Ruby Conference was held in November 2003, in Austin, Texas.
To get a feel for OOoExtract, download the program; currently, you can get the application as a single executable file or as a tarball with constituent libraries in separate files. Once installed, we can create a simple Writer document and run some searches. If you have OOo handy, fire it up and enter some brief text, such as:
My sample document It has two lines
Save the file as sample1.sxw to the same directory where you installed OOoExtract, and run OOoExtract from the command line, like this:
./ooo_extract.rb --text sample sample1.sxw My sample document
The program searches sample1.sxw for any lines that match on the word sample. Actually, this is a regular expression, albeit a simple one. We also can use more complex expressions, such as this one that matches any three-letter word:
./ooo_extract.rb --text "\s\w\w\w\s" sample1.sxw It has two lines
This is all well and good, but OOoExtract really shines by letting us search on content metadata, the extra information about the text in our document. Suppose we add an additional line to our sample Writer document:
This one has some extra formatting
After entering the text, select the word extra and apply the Footer paragraph style. Save the file and run this search:
./ooo_extract.rb --style="Footer" sample1.sxw This one has some extra formatting
In addition to locating text based on content, OOoExtract also can give you text with specific markup. This is quite handy if you create your own semantically rich styles. You then can use OOoExtract to retrieve information based on content and meaning, effectively turning an OpenOffice.org Writer document into a lightweight database. You can run the program against multiple files by using wild cards in the filename. For example, suppose you store recipes in Writer files. If you've defined and used custom styles, you could locate specific information, such as what recipes have apples as an ingredient:
./ooo_extract.rb --text="apple" --style="Ingredient" recipes/*.sxw AppleSalsa.sxw: 2 medium red apples AppleStrudel.sxw: 4 cups peeled and sliced apples
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- Profiles and RC Files
- Understanding Ceph and Its Place in the Market
- Astronomy for KDE
- The Giant Zero, Part 0.x
- OpenSwitch Finds a New Home
- Maru OS Brings Debian to Your Phone
- Git 2.9 Released
- What's Our Next Fight?
- SoftMaker FreeOffice
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide