Manipulating OOo Documents with Ruby
OpenOffice.org (OOo), a featureful suite of office tools that includes applications for word processing, spreadsheet creation and presentation authoring, has seen an increase in enhancements and overall quality. OOo lives up to its name by making both source code and file formats completely open. This is a big plus for anyone wishing to manipulate documents without needing to have the creator application present.
In general, two ways exist to access or manipulate document content. One is to automate the source application, letting a program substitute for a person entering commands. The other is to go directly to the document. An advantage of the first approach is you get to exploit the power of an existing application, saving yourself a good deal of time figuring out file formats and processing commands. OOo can execute internal macros and expose a scripting interface through UNO. The downside is you need to have the actual application handy, and even then it may not be able to do what you want. This article describes the second approach: accessing and manipulating documents by going directly to the source.
I first became aware of what could be done with an OpenOffice.org document when Daniel Carrera announced his OOoExtract program. This is a Ruby application that allows you to run command-line searches of OOo Writer document content. As the home page states, OOoExtract performs matches on both text content and styles, executes search patterns using full regular expressions and runs searches built with Boolean operators. The program runs on any platform that has a Ruby interpreter, and they are available for pretty much every OS around.
Ruby has been discussed before in Linux Journal, but if you are not familiar with it, a good though brief description might be to say it's a cross between Perl and Smalltalk, with some features from Lisp and Python. It is deeply object-oriented and has a clean intuitive syntax. Yukihiro “Matz” Matsumoto, its creator, released the first alpha version in 1994. It has grown steadily in popularity, and the Third International Ruby Conference was held in November 2003, in Austin, Texas.
To get a feel for OOoExtract, download the program; currently, you can get the application as a single executable file or as a tarball with constituent libraries in separate files. Once installed, we can create a simple Writer document and run some searches. If you have OOo handy, fire it up and enter some brief text, such as:
My sample document It has two lines
Save the file as sample1.sxw to the same directory where you installed OOoExtract, and run OOoExtract from the command line, like this:
./ooo_extract.rb --text sample sample1.sxw My sample document
The program searches sample1.sxw for any lines that match on the word sample. Actually, this is a regular expression, albeit a simple one. We also can use more complex expressions, such as this one that matches any three-letter word:
./ooo_extract.rb --text "\s\w\w\w\s" sample1.sxw It has two lines
This is all well and good, but OOoExtract really shines by letting us search on content metadata, the extra information about the text in our document. Suppose we add an additional line to our sample Writer document:
This one has some extra formatting
After entering the text, select the word extra and apply the Footer paragraph style. Save the file and run this search:
./ooo_extract.rb --style="Footer" sample1.sxw This one has some extra formatting
In addition to locating text based on content, OOoExtract also can give you text with specific markup. This is quite handy if you create your own semantically rich styles. You then can use OOoExtract to retrieve information based on content and meaning, effectively turning an OpenOffice.org Writer document into a lightweight database. You can run the program against multiple files by using wild cards in the filename. For example, suppose you store recipes in Writer files. If you've defined and used custom styles, you could locate specific information, such as what recipes have apples as an ingredient:
./ooo_extract.rb --text="apple" --style="Ingredient" recipes/*.sxw AppleSalsa.sxw: 2 medium red apples AppleStrudel.sxw: 4 cups peeled and sliced apples
Practical Task Scheduling Deployment
July 20, 2016 12:00 pm CDT
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.Register Now!
- Stunnel Security for Oracle
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- SourceClear Open
- SUSE LLC's SUSE Manager
- My +1 Sword of Productivity
- Managing Linux Using Puppet
- Google's SwiftShader Released
- Parsing an RSS News Feed with a Bash Script
- Non-Linux FOSS: Caffeine!
- SuperTuxKart 0.9.2 Released
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide