in the Limelight

How CHIP Special Linux uses Writer as an editorial tool in a multiplatform publishing house. is a great set of software, consisting of several useful components that offer a lot of options. It is customizable and introduces many open formats for documents. In order to adapt the basic configurations to your particular needs, allows you to prepare macros and additional scripts.

I work as an editor at a Polish free software magazine. At the beginning of the editorial process, the author supplies the text and the editor edits it. Editing means removing common content-related and formal mistakes or errors, as well as preparing the text in a standard form to make it easier to process at further stages. The proofreader then corrects the text and the editor looks through it again and makes the final changes. Finally, the typesetter prepares the text for printing, and the editor checks the entire work one last time.

The processed text is in a different format at each stage of this process. Our publishing house prefers open formats for documents, so our authors deliver the documents in text or HTML formats and the graphics in PNG or EPS formats. After editing the document, the editor sends a copy to the author—that copy is in HTML. Our proofreaders work on Microsoft Windows systems and use Microsoft Word, so they need the documents to be in the .doc file format. Our typesetters work on Macintosh systems and use QuarkXPress. They need two kind of documents: Microsoft Word files for printing and checking the required formats for the article and Macintosh text files for opening the files in Quark and processing them.

When our quarterly started in autumn 2000, I was using StarOffice. Since then, I switched to The methods to work with authors' text files are similar for StarOffice and I import the document in text or HTML format using StarWriter (previously) or Writer (at present), and—after processing it—I export it to HTML, Microsoft Word or the corresponding SDW or SXW file formats.

Figure 1. The KillparZ macro facilitates preprocessing of the imported text files.

Importing Text and HTML Files

If a source file is prepared well, there should be no problems when importing it. If a file is damaged, it must be repaired. This is not difficult to do if you take into account the open formats of the documents.

Once a file is imported, you need to change it to the proper format. The editors of Polish, German, French or other non-English language publications should change the codepage as well. A standard codepage for Polish documents, for example, is ISO-8859-2, and the standard codepage for all documents is UTF-8. To convert imported documents in a convenient way, you need a macro. The macros I've built for consist of several codepage converters, including converters from ISO-8859-2 to UTF-8 and vice versa.

Paragraphs in text files written in some text editors may be broken into a number of lines. To consolidate them, you need to use the KillparZ macro, which is an improved version of the killpars macro by Andrew Brown (Figure 1). KillparZ is a component of the ooo-macro bundle.

Assuming the author of the document declared the appropriate charset, there shouldn't be a problem with the codepage when you import an HTML file. But another problem may arise—the shortcuts associated with your macros stop working in HTML documents. To make macros work, you need to create an empty Writer document, open the HTML file, copy it, close the HTML file and, finally, paste the content into the Writer document.

Codepages and DOCs

Our magazine is published in Polish, so I need to use more sophisticated methods when exporting files. Specifically, I need to use fonts with Polish diacritics. My tests of StarWriter and Writer have shown that if you want to avoid problems related to codepages in non-English language documents, you should use TrueType fonts instead of Type1 fonts. Moreover, you obtain the best effects of exporting documents to the Microsoft Word format if you use the same fonts as are used in Microsoft Windows. The Microsoft fonts, bundled in Microsoft FontPack, including Times New Roman, Arial and Courier New, are sufficient in most cases.

The authors of StarOffice and had to use some reverse engineering to discover how the Microsoft Word format is constructed. As a result, the export filter from Writer to Word works well but not perfectly. Therefore, if you want to exchange standard document types with other users, prepare one typical document using all the necessary formatting, including headers, italics and boldface. Then make the sample available to coworkers and ask them if everything works well.

The articles we publish are a simple kind of document. Our editorial office uses the three above-mentioned fonts, as well as italic and bold, two levels of headers and straight tables. We do not include the graphics in our documents; we simply list the names of the files in PNG or EPS format. Such documents can be exported from SDW or SXW formats to Microsoft Word without any problems.

Figure 2. An HTML file as exported by—it uses styles, classes and a lot of other unwanted formatting.

Figure 3. The same HTML file converted using the soffice2html filter—more standardized and more readable.

Figure 4. CHIP Special editorial staff, from left to right: Robert Bielecki (editor), Romek Gnitecki (editor in chief), Cezary M. Kruk (CHIP Special Linux) and Tomek Borukalo (editor).



Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Are the Perl scripts available?

Hobgoblin's picture

Interesting reading - I was particularly interested in the way that 'clean' HTML can be obtained from the raw HTML output of Ooo writer. Are the Perl scripts available?


It looks like these perl scri

Anonymous's picture

It looks like these perl scripts manipulate tags using regular expressions. There are too many edge cases for this to be a good idea. Much better would be to properly parse the HTML, tidy it and walk the DOM tree to remove unwanted elements. The HTML section at CPAN looks like a good place to start doing that. Mind you, there are also interfaces to manipulate OOo documents directly. Good luck, happy hacking!

Geek Guide
The DevOps Toolbox

Tools and Technologies for Scale and Reliability
by Linux Journal Editor Bill Childers

Get your free copy today

Sponsored by IBM

8 Signs You're Beyond Cron

Scheduling Crontabs With an Enterprise Scheduler
On Demand
Moderated by Linux Journal Contributor Mike Diehl

Sign up now

Sponsored by Skybot