OpenOffice.org in the Limelight

How CHIP Special Linux uses OpenOffice.org Writer as an editorial tool in a multiplatform publishing house.

OpenOffice.org is a great set of software, consisting of several useful components that offer a lot of options. It is customizable and introduces many open formats for documents. In order to adapt the basic configurations to your particular needs, OpenOffice.org allows you to prepare macros and additional scripts.

I work as an editor at a Polish free software magazine. At the beginning of the editorial process, the author supplies the text and the editor edits it. Editing means removing common content-related and formal mistakes or errors, as well as preparing the text in a standard form to make it easier to process at further stages. The proofreader then corrects the text and the editor looks through it again and makes the final changes. Finally, the typesetter prepares the text for printing, and the editor checks the entire work one last time.

The processed text is in a different format at each stage of this process. Our publishing house prefers open formats for documents, so our authors deliver the documents in text or HTML formats and the graphics in PNG or EPS formats. After editing the document, the editor sends a copy to the author—that copy is in HTML. Our proofreaders work on Microsoft Windows systems and use Microsoft Word, so they need the documents to be in the .doc file format. Our typesetters work on Macintosh systems and use QuarkXPress. They need two kind of documents: Microsoft Word files for printing and checking the required formats for the article and Macintosh text files for opening the files in Quark and processing them.

When our quarterly started in autumn 2000, I was using StarOffice. Since then, I switched to OpenOffice.org. The methods to work with authors' text files are similar for StarOffice and OpenOffice.org. I import the document in text or HTML format using StarWriter (previously) or OpenOffice.org Writer (at present), and—after processing it—I export it to HTML, Microsoft Word or the corresponding SDW or SXW file formats.

Figure 1. The KillparZ macro facilitates preprocessing of the imported text files.

Importing Text and HTML Files

If a source file is prepared well, there should be no problems when importing it. If a file is damaged, it must be repaired. This is not difficult to do if you take into account the open formats of the documents.

Once a file is imported, you need to change it to the proper format. The editors of Polish, German, French or other non-English language publications should change the codepage as well. A standard codepage for Polish documents, for example, is ISO-8859-2, and the standard codepage for all OpenOffice.org documents is UTF-8. To convert imported documents in a convenient way, you need a macro. The macros I've built for OpenOffice.org consist of several codepage converters, including converters from ISO-8859-2 to UTF-8 and vice versa.

Paragraphs in text files written in some text editors may be broken into a number of lines. To consolidate them, you need to use the KillparZ macro, which is an improved version of the killpars macro by Andrew Brown (Figure 1). KillparZ is a component of the ooo-macro bundle.

Assuming the author of the document declared the appropriate charset, there shouldn't be a problem with the codepage when you import an HTML file. But another problem may arise—the shortcuts associated with your macros stop working in HTML documents. To make macros work, you need to create an empty OpenOffice.org Writer document, open the HTML file, copy it, close the HTML file and, finally, paste the content into the Writer document.

Codepages and DOCs

Our magazine is published in Polish, so I need to use more sophisticated methods when exporting files. Specifically, I need to use fonts with Polish diacritics. My tests of StarWriter and OpenOffice.org Writer have shown that if you want to avoid problems related to codepages in non-English language documents, you should use TrueType fonts instead of Type1 fonts. Moreover, you obtain the best effects of exporting documents to the Microsoft Word format if you use the same fonts as are used in Microsoft Windows. The Microsoft fonts, bundled in Microsoft FontPack, including Times New Roman, Arial and Courier New, are sufficient in most cases.

The authors of StarOffice and OpenOffice.org had to use some reverse engineering to discover how the Microsoft Word format is constructed. As a result, the export filter from Writer to Word works well but not perfectly. Therefore, if you want to exchange standard document types with other users, prepare one typical document using all the necessary formatting, including headers, italics and boldface. Then make the sample available to coworkers and ask them if everything works well.

The articles we publish are a simple kind of document. Our editorial office uses the three above-mentioned fonts, as well as italic and bold, two levels of headers and straight tables. We do not include the graphics in our documents; we simply list the names of the files in PNG or EPS format. Such documents can be exported from SDW or SXW formats to Microsoft Word without any problems.

Figure 2. An HTML file as exported by OpenOffice.org—it uses styles, classes and a lot of other unwanted formatting.

Figure 3. The same HTML file converted using the soffice2html filter—more standardized and more readable.

Figure 4. CHIP Special editorial staff, from left to right: Robert Bielecki (editor), Romek Gnitecki (editor in chief), Cezary M. Kruk (CHIP Special Linux) and Tomek Borukalo (editor).

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Are the Perl scripts available?

Hobgoblin's picture

Interesting reading - I was particularly interested in the way that 'clean' HTML can be obtained from the raw HTML output of Ooo writer. Are the Perl scripts available?

TIA
H.

It looks like these perl scri

Anonymous's picture

It looks like these perl scripts manipulate tags using regular expressions. There are too many edge cases for this to be a good idea. Much better would be to properly parse the HTML, tidy it and walk the DOM tree to remove unwanted elements. The HTML section at CPAN looks like a good place to start doing that. Mind you, there are also interfaces to manipulate OOo documents directly. Good luck, happy hacking!

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix