The OASIS Standard for Office Documents: How All Users and Developers Can Benefit

A common set of file formats has the potential to be the most meaningful advancement for free software on the desktop.
Parsers and Libraries

At a lower level, what is needed to manage OASIS files in a larger application, where the source language usually is C or C++ and the performance must be maximized? First of all, the program must include the proper library to compress and uncompress zipped files. This is not an OASIS-specific issue, so we won't deal with it further.

Once the single XML files are available, they have to be loaded in a way that understands and makes accessible the internal structure, that is, the relationships among the several elements. Once this step has been performed, data can be converted or processed in any manner. A lot of tools for this already exist. Several of them are designed to support general XML rather than OASIS, but the difference is quite a bit smaller than one might expect. And this situation is expected to improve soon after the standard is released.

Expat is a popular XML parser written in C that is basic and lacks a validation capability but still is the fastest one around. It also has front ends for practically every language. A more featureful library that supports DTD validation and is designed specifically for GNOME is Libxml. Like Expat, Libxml is written in C, is portable and can be used within a lot of languages. The Xerces parser, in Java, also can generate and validate XML documents.

In the Qt/KDE field, developers have at their disposal, besides the OOo plugin already mentioned, the related Qt classes and DOM implementation (QDom) to write or parse XML, as well as the KOffice DTD. At the time of this writing, these tools still target the KOffice XML format, but they are expected to converge on the OASIS standard.

For security-conscious developers, the easiest starting point is the C XML security library (XMLsec), based on LibXML2, which supports both signing and encryption of XML material. SAXEcho is a (mostly) Java program that attaches itself to a running OpenOffice.org document to show the XML tree representation of the current document. It also validates or modifies the document operating directly on XML nodes, plus several other nifty things.

Event-Driven XML Processing

The parsers described above build an internal tree representation of the document. What should one do when developing applications that must deal with large documents? Keep in mind that large here means too big to fit into memory, which is not so big if this format must be usable even for low-end desktop applications.

The current solutions in this space follow the so-called SAX (simple API for XML) approach: instead of building the whole tree of a document in one fell swoop and keeping it there for further processing, go step by step. A SAX parser reads the document and, instead of keeping it all in memory, generates an event every time it finds something worthwhile. The parser then passes the event to event handlers that interact with the application. The something worthwhile can be XML document-type definitions, errors or elements of the actual content. A good starting point for SAX-based programming is the SAX Project. SAX2 already is supported in Java through JAXP and in Perl through the Orchard Project, which is quite stable, not to mention fast and lightweight, as far as SAX and XML processing are concerned.

Conclusion

All the research done for this article confirmed one of my first impressions: so far, the free software/open-source software approach to guarantee information interchange has been to develop cross-platform applications, which are difficult to maintain and optimize for each target environment. Now it looks like we are starting to do the right thing, which is to define truly Free, standard, toolkit-independent, cross-platform formats that leave everyone free to create any possible front end to read and write them.

Acknowledgements

Thanks above all to Gary Edwards and David Faure for all the material and explanations. Pierre Souchay (kfile-plugin-ooo) and the AbiWord developers also were very helpful.

Marco Fioretti is a hardware systems engineer interested in free software both as an EDA platform and, as the current leader of the RULE Project, as an efficient desktop. Marco lives with his family in Rome, Italy.

______________________

Articles about Digital Rights and more at http://stop.zona-m.net CV, talks and bio at http://mfioretti.com

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

just a dream

Spencer's picture

I hope some day it won't matter if I work on my text with KWord, AbiWord, OpenOffice Writer or ...... Micrsoft Word

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix