The OASIS Standard for Office Documents: How All Users and Developers Can Benefit
Desktop integration begins with documents, not with any toolkit or bundle of applications. If files can be read and written by every application, users can communicate, work together and become integrated. In this sense, the OASIS XML format for office documents has the potential to be one of the most meaningful advances in free computing.
OASIS stands for Organization for the Advancement of Structured Information Standards. Formerly SGML Open, this nonprofit consortium, which includes such companies as IBM, Sun and Boeing, aims to create open standards for almost any kind of structured information. The one we cover here is an XML-based format common to all kinds of office files—text, spreadsheets, presentations and more.
The significance of an effort of this caliber to promote a file format, rather than any specific desktop, application or the Linux kernel itself, cannot be underestimated. Free as in free formats is even more important than free software. Only with them and the internal structuring that comes from XML can data be exchanged, with new or different programs without any need for converters, or be directly edited, indexed, analyzed and exchanged between heterogeneous groups or servers—like Web services without the hype. Data will start belonging exclusively to end users.
The OASIS Office Technical Committee had its first meeting in February 2003. The official file format should be voted on in February 2004. After the approval, Phase 2 will start; its main goal will be to extend the base specification to additional areas of application. The real goal is the move to a document-centric model, independent from and available to any given program, regardless of its license. The Technical Committee is determined to quit with the assumption that every file spec must be application-bound, as today.
Some farsighted public administrations already have started to think in this way. The Swedish Agency for Public Management says, “[We] should also follow and if possible support work that takes place in OASIS....An open file format for office software is of great importance for increased interoperability” (www.openoffice.org/servlets/ReadMsg?msgId=585772&listName=discuss). At the European Union level, IDA (Interchange of Data between Administrations) decided in 2003 to carry out exploratory work on open document formats and on how public administrations could persuade software vendors to support them.
The standard conforms to general W3C specifications for XML technologies and covers every aspect of document usages. User interaction, for example, is described in XML schema templates, which operate like traditional API functions. Even they, however, now are independent of any single application.
A text format can be much bigger and more inefficient than an equally free but binary one. Even when the performance hit would be noticeable, however, the benefits simply are too great to give up. In itself, an OASIS office file (be it text, presentation or spreadsheet) is a zip archive: the compression format chosen is a compromise of efficiency, speed of accessing internal parts and algorithm license. Unzipping it, we first find five XML files: styles.xml, presentation and formatting; contents.xml, actual contents; settings.xml, application settings such as zoom level and printer; meta.xml, language and uncoding metadata; and manifest.xml, an explanation of what all the other files are and their relative paths.
Other components (each in a predefined folder, so that even virus scanners have an easier time) may be macros, their dialogs and objects, such as charts or formulas.
Because the standard imposes that all pieces must be present in the zip archive, no information is lost: content, layout and everything else always travel together. Unlike some proprietary offerings in the same space, there is no restriction on which application must be employed to make full use of a document. WYSIWYG results are possible and can be specified fully or replaced in the styles.xml file. At the same time, however, content and presentation are decoupled; hence, content and nothing else is attainable by any application, for any conceivable use. kfile-plugin-ooo, for example, extracts all the metadata embedded in the new file format. The end user then can read, search by metadata or modify all this information straight from KOffice or Konqueror. This plugin also is included in the latest KOffice source trees.
Text format and internal structure make decades of UNIX experience in processing and generating text come back with a vengeance to tame complex, WYSIWYG office documents of every kind. Shell one-liners, Web spiders and so on can query and process directly, much like a database engine, single documents or whole classes of them. Viewing attached presentations as text in mutt or industry-level content management systems becomes easier. As a proof of concept, I was able to get the (admittedly rough) outline of Listing 1 from a presentation simply by typing:
# tr "<" "\012" < content.xml | grep ^text \ | cut '-d>' -f2, | uniq
Articles about Digital Rights and more at http://stop.zona-m.net CV, talks and bio at http://mfioretti.com