Building Impress and PowerPoint Slides with LaTeX and Perl

Forced to use proprietary file formats? Let open source ease the burden.
The Filter for Extracting Textual Content

The getcontent script is the type of script that Perl programmers typically create, use and then throw away. (See the on-line Resources for downloading the files referred to in this article.) It loops on standard input, reading one line at a time, and attempts to pattern-match on content of interest. If a match occurs, appropriate output is produced. As an example of what getcontent does, here's the code for dealing with the chapter title from the LaTeX file:


if ( /\\chapter\{(.*)\}/ )
{
    print "CHAPTERTITLE: $1\n";
    next;
}

A simple regular expression attempts to match on the LaTeX chapter macro; if a match is found, the chapter title is extracted and output is generated. The call to next short-circuits the loop, allowing the next line to be read in from standard input when a match is found. In this way, the following LaTeX snippet:

\chapter{Working with Regular Expressions}

is transformed into this textual content:

CHAPTERTITLE: Working with Regular Expressions

That is, the LaTeX markup is removed and replaced with a much simpler markup. The section and subsection LaTeX macros were treated in a similar way. Here's the code:


if ( /\\section\{(.*)\}/ )
{
    print "BULLETTITLE: $1\n";
    next;
}

if ( /\\subsection\{(.*)\}/ )
{
    print "BULLETCONTENT: $1\n";
    next;
}

Working with source code listings is only slightly more complex, due to the requirement to spot when a chunk of verbatim text has been entered and exited. Here's the code that handles entry into a LaTeX verbatim block:


if ( /\\begin\{verbatim\}/ )
{
    print "STARTCODE\n";
    $in_verbatim = TRUE;
    next;
}

And, here's the code used to handle the exit from a verbatim block:


if ( $in_verbatim )
{
    if ( /\\end\{verbatim\}/ )
    {
        print "STOPCODE\n";
        $in_verbatim = FALSE;
    }
    else
    {
        print;
    }
    next;
}

A simple boolean, the $in_verbatim scalar, helps to determine whether the script currently is working within a verbatim block. Similar code extracts the maxims that appear throughout the book's chapters, and a few if blocks handle the graphics, their captions and other content of interest. For example, consider the following chunk of LaTeX markup:


\chapter{The Basics}

\textit{Getting started with Perl.}

\section{Let's Get Started!}

There is no substitute for practical experience when first
learning how to program. So, here is the first Perl program
\index{welcome@\texttt{welcome}, and the first program, called
\texttt{welcome}:

\begin{verbatim}
    print "Welcome to the World of Perl!\n";
\end{verbatim}

\noindent When executed by \texttt{perl}
\footnote{We will learn how to do this is in
just a moment.}, this small program displays
the following, perhaps rather not unexpected,
message on screen:

\begin{verbatim}
    Welcome to the World of Perl!
\end{verbatim}

The getcontent script transforms the above LaTeX into this textual content:


CHAPTERTITLE: The Basics
CHAPTERCONTENT: Getting started with Perl.
BULLETTITLE: Let's Get Started!
STARTCODE
    print "Welcome to the World of Perl!\n";
STOPCODE
STARTCODE
    Welcome to the World of Perl!
STOPCODE

Notice how all of the LaTeX markup is gone, replaced by a simpler markup language that will be used to produce slides programmatically. Assuming the LaTeX chunk was in a file called chapter3.tex, the getcontent script is executed as follows, piping the result of the transformations into an appropriately named file:

perl getcontent chapter3.tex > chapter3.input

The chapter3.input file now contains the textual content, and it can be fine-tuned with any text editor prior to producing the slides.

The Impress Presentation Creation Filter

Producing the slides within an Impress document was complicated by a number of factors. For starters, the OpenOffice::OODoc module cannot be used to create a new OpenOffice.org file; it can manipulate existing files only. Additionally, the module was created with a view to working primarily with OpenOffice.org Writer files—word processor documents—not Impress presentations. By way of example, here's a short program, called appendpara, that adds some text to an already existing Writer document:

#! /usr/bin/perl -w

use strict;

use OpenOffice::OODoc;

my $document = ooDocument( file => 'blank.sxw' );

$document->appendParagraph
(
    text    => 'Some new text',
    style   => 'Text body'
);

$document->save;

This small program uses the OpenOffice::OODoc module and creates a document object from the existing Writer file. The program then invokes the appendParagraph method to add some text before invoking the save method to commit the changed document to disk.

In addition to the appendParagraph method, the OpenOffice::OODoc module provides the insertElement method, which allows a new page of a specified type to be added to a document. The page can be a clone of an existing page or it can be actual, raw XML.

After reading as far as page 6 of the 600+ page OpenOffice.org XML file format document, I discovered that Impress used the //draw:page XML type to represent a slide within a presentation. Unfortunately, the OpenOffice::OODoc module could not work directly with objects of this type, so I had to come up with some other mechanism to manipulate the data. Specifically, I wanted to take the blank template slides contained in the blank.sxi document and clone each slide as I needed it, populating the slide's content with the textual content produced by the getcontent script. To do so, I needed to learn more about the Impress XML format.

I had two choices: continue to read the 600+ page standard document or take a look at an actual file to see if I could learn enough to get the job done. I chose the latter. Recalling from a previous Linux Journal article that OpenOffice.org compacts its multipart file using the popular ZIP algorithm, I created a temporary directory and unzipped the blank.sxi file:

mkdir unzipped
cd unzipped
unzip ../blank.sxi

This produced a bunch of files and directories:

content.xml
META-INF
meta.xml
mimetype
settings.xml
styles.xml

Of most interest is the content.xml file, which contains the actual content that makes up the document. Viewing this onscreen or within an editor produced a mass of hard-to-decipher XML. In order to keep the parts as small as possible, no attention had been paid to formatting the XML, in any of the parts of the zipped container, in any meaningful way. Typically, the XML is dumped/stored as a non-indented, non-whitespace text stream. To try to make sense of it, I needed to be able to print the XML in a legible manner. In what I can describe only as a moment of temporary inspiration, I dropped into a command-line and typed xml followed by two tabs. A listing of pre-installed tools that start with the letters xml appeared on screen:

xml2-config     xml-config      xmllint
xmlto           xml2man         xml-i18n-toolize
xmlproc_parse   xmlwf           xml2pot
xmlif           xmlproc_val     xmlcatalog
xmlizer         xmltex

The xmllint tool immediately caught my eye. Reading its man page uncovered the --format option, which—yes, you guessed it—pretty-prints XML provided to the tool. Therefore, typing xmllint --format content.xml resulted in output I could pipe to less and actually read without losing my sanity. Here's an abridged snippet of the pretty-printed content.xml showing the XML for the title_slide from the blank.sxi Impress document:


<draw:page draw:name="page1" draw:style- ...
  <draw:text-box presentation:style-name= ...
    <text:p text:style-name="P1">
      <text:span text:style-name="T1">
        ChapterTitleSlide
      </text:span>
    </text:p>
  </draw:text-box>
  <draw:text-box presentation:style-name= ...
    <text:p text:style-name="P3">
      <text:span text:style-name="T2">
        ChapterTitleSlideText
      </text:span>
    </text:p>
  </draw:text-box>
  <presentation:notes>
    <draw:page-thumbnail draw:style-name= ...
    <draw:text-box presentation:style-name ...
  </presentation:notes>
</draw:page>

Notice the ChapterTitleSlide and ChapterTitleSlideText content, which I had typed into blank.sxi when creating it with Impress. If I could use the insertElement method to add raw XML based on this extract, with the empty content replaced with my textual content, I'd be home free.

By way of example, consider what happens once the title of the presentation and its subtitle are processed by produce_slides. The insertElement method is invoked as follows, creating a new slide:

$presentation->insertElement( '//draw:page',
  $last_slide++,
    title_slide( $title_title, $title_content ),
      position => 'after' );

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Great ideas, thanks!

Anonymous's picture

Great ideas, thanks!

getcontent script

Norberto's picture

I am not my self a perl programmer. A way to obtain a workable getcontent script?

Best

Missing getcontent script

barryp's picture

Sorry ... the script appears to be missing from the download. Here it is:

#! /usr/bin/perl -w

#
# The "getcontent" script: Given a LaTeX file on the command-line,
# extract it's textual content.
#
# By Paul Barry, paul.barry@itcarlow.ie
#

use strict;

use constant TRUE => 1;
use constant FALSE => 0;

my $in_verbatim = FALSE;
my $in_maxim = FALSE;
my $graphic_name = '';

while ( <> )
{
if ( $in_maxim )
{
if ( /\\end\{maxim\}/ )
{
print "STOPMAXIM\n";
$in_maxim = FALSE;
}
else
{
print;
}
next;
}

if ( $in_verbatim )
{
if ( /\\end\{verbatim\}/ || /\\end\{alltt\}/ )
{
print "STOPCODE\n";
$in_verbatim = FALSE;
}
else
{
print;
}
next;
}

if ( /\\chapter\{(.*)\}/ )
{
print "CHAPTERTITLE: $1\n"; next;
}

if ( /\\section\{(.*)\}/ )
{
print "BULLETTITLE: $1\n"; next;
}

if ( /\\subsection\{(.*)\}/ )
{
print "BULLETCONTENT: $1\n"; next;
}

if ( /\\begin\{verbatim\}/ || /\\begin\{alltt\}/ )
{
print "STARTCODE\n";
$in_verbatim = TRUE; next;
}

if ( /\\begin\{maxim\}/ )
{
print "STARTMAXIM\n";
$in_maxim = TRUE; next;
}

if ( /images\/(.*?)\}/ )
{
$graphic_name = $1; next;
}

if ( /\\caption\{\\label\{/ )
{
/label\{.*?\}(.*)\}\}/;
print "GRAPHICCAPTION: $1\n";
print "GRAPHICNAME: $graphic_name\n"; next;
}

if ( /^\\textit\{(.*)\}/ )
{
print "CHAPTERCONTENT: $1\n"; next;
}
}

Paul Barry

Some important updates to the OpenOffice::OODoc module

barryp's picture

Jean-Marie Gouarné contacted me via e-mail with some updates on the status of his excellent Perl module. Here's what he said:

Thanks for this article. It's very useful for evangelization about the OOo XML format... And (that is much less important) thanks for your test with my OpenOffice::OODoc module!

However, I've just 2 remarks about your quotation of this Perl module:

1) OpenOffice::OODoc *can* create new OOo files (texts, spreadsheets, presentations and drawings) from scratch; this feature is available since version 1.201 (2004-07-30). To do so, the ooDocument() constructor must be called with a create => $class option (where $class is the document class, i.e. "text", "spreadsheet", etc).

2) The module has notably evolved in the meantime; now it supports both the OpenOffice.org 1.0 and the OpenDocument formats; in addition, there are a few draw- or impress-focused methods (so, for example, such methods as insertDrawPage or appendDrawPage are available in order to organize and copy presentation slides). But you were right when you said that "the module was created with a view to working primarily with OpenOffice.org Writer files". Text documents were and remain the main target.

I thought it worthwhile to post his message here. Thanks.

--
Paul Barry
IT Carlow, Ireland
http://glasnost.itcarlow.ie/~barryp

Paul Barry

Writing to Impress from Perl

Michelle Chang's picture

Easy way to write to Impress/Powerpoint as Jean-Marie said:

#! /usr/bin/perl -w

use strict;
use OpenOffice::OODoc;

# start a new preso
my $preso = ooDocument(file => 'test.sxi', create => 'presentation');

my $slide = $preso->getDrawPage(0); # slide 0

$preso->createTextBox
(
attachment => $slide,
size => '10cm, 2cm',
position => '1cm, 2cm',
content => 'I want to write to Impress from Perl'
);

$preso->save;

Programmatic Conversions?

Jordan's picture

Thanks for the great article! Do you know if there is a way to programmatically convert the resulting impress document to PowerPoint? Perhaps as in $preso->export(...)?

Or would I need to use something like the Python-UNO bridge to do so?

image extraction from pdf's

girvim01's picture

Would "pdfimages" (part of the xpdf package) have sped up the final step (extraction of images from the pdf page proofs)?

using pdfimages?

barryp's picture

> sped up the final step?

maybe ... if I had known about it! :-)

Thanks - I'll check out pdfimages.

Paul.

Paul Barry

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix