Extract and Parse ODF Files with Python
The Open Document Format (ODF) Alliance is designed for sharing information between different word processing applications. This article highlights the basic structure of ODF files, some internals of the underlying XML files and shows how to use Python to read the contents to perform a simple search for keywords. The code also can be the basis for more-advanced operations. In the spirit of openness, we use open-source software to read the ODF files, which in this case are Python and the OpenOffice.org package.
If you are running a recent version of Linux or OS X, you already should have Python and OpenOffice.org installed on your machine. If you need the latest versions, Python is available for free from www.python.org for both the Windows and Linux platforms. The OpenOffice.org package also is available for free from www.openoffice.org. Installing OpenOffice.org on an XP desktop is relatively painless. Download the packages from their respective sites and run the installer. Once installed, simply run the application from the desktop with a click on the installed icons.
Most folks do have Microsoft Office installed. If that's the case, the solution is to use a plugin for Microsoft Word (sourceforge.net/projects/odf-converter). You can install both the OpenOffice.org and Microsoft packages on the same machine without causing any conflicts.
Please read the Bugs section on the SourceForge site for any incompatibilities before you install the plugin. I used the OpenOffice.org suite to save the files for this article, because it was easier.
Once you have confirmed that you have the prerequisites, you can create an ODF file. Open up the Writer, type some text in a document and save it. You can read in a file and save it as an .odt file.
A quick look at extensions in the Save dialog reveals a lot. An ODF file can have many extensions, which provide a clue as to the type of information stored in it and the application that stored it. See Table 1.
Table 1. ODF File Types and Their Extensions
|Document Format||File Extension|
|OpenDocument Text Template||*.ott|
|OpenDocument Master Document||*.odm|
|HTML Document Template||*.oth|
|OpenDocument Spreadsheet Template||*.ots|
|OpenDocument Drawing Template||*.otg|
|OpenDocument Presentation Template||*.otp|
So, what's in an ODF file? An ODF file is basically a zipped archive with several XML files. The actual files and directories in a file will vary depending on the type of information and the system on which the document was created.
The first step in picking out the names of the files in an ODF file requires unzipping the file itself. Fortunately, Python has built-in support for dealing with this endeavor with the zipfile module. Type python on the command line to run an interactive shell. Running a shell allows you to examine the contents of objects returned from the modules. Because you'll probably be doing this only once per type of data, there is really no need to write and execute a script at this time. If you want to preserve the work for future use, it's better to write a script in a text editor or use the IDLE editor that comes with Python and save the script. See Listing 1 on how to show the member functions in a class or module.
Listing 1. Showing the Member Functions in a Class or Module
Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. >>> import zipfile >>> myfile = zipfile.ZipFile ↪('E:/articles/odf/theArticle.odt') >>> dir(myfile) ['NameToInfo', '_GetContents', '_RealGetContents', '__del__', '__doc__', '__init__', '__module__', '_filePassed', '_writecheck', 'close', 'comment', 'compression', 'debug', 'filelist', 'filename', 'fp', 'getinfo', 'infolist', 'mode', 'namelist', 'printdir', 'read', 'start_dir', 'testzip', 'write', 'writestr'] >>> >>> >>> listoffiles = myfile.infolist() >>> dir(listoffiles) ['CRC', 'FileHeader', '__doc__', '__init__', '__module__', 'comment', 'compress_size', 'compress_type', 'create_system', 'create_version', 'date_time', 'external_attr', 'extra', 'extract_version', 'file_offset', 'file_size', 'filename', 'flag_bits', 'header_offset', 'internal_attr', 'orig_filename', 'reserved', 'volume'] >>>
The infolist() command from the Python documentation returns a list the objects of a zipped archive. Run the dir() command on the first object from this list to get more information stored for each zipped file (Listing 2). This nice feature of looking at members in an object is called introspection.
An iteration on the list of objects displays relevant information about each file, such as when it was created, its extracted size, its compressed size and so on. It's better at this point to write a Python script and run it rather than work at the command line of an interactive Python shell. This way, you can preserve the script for later use. So, open up a text editor and type in the script.
Getting Started with DevOps - Including New Data on IT Performance from Puppet Labs 2015 State of DevOps Report
August 27, 2015
12:00 PM CDT
DevOps represents a profound change from the way most IT departments have traditionally worked: from siloed teams and high-anxiety releases to everyone collaborating on uneventful and more frequent releases of higher-quality code. It doesn't matter how large or small an organization is, or even whether it's historically slow moving or risk averse — there are ways to adopt DevOps sanely, and get measurable results in just weeks.
Free to Linux Journal readers.Register Now!
|Secure Server Deployments in Hostile Territory, Part II||Jul 29, 2015|
|Hacking a Safe with Bash||Jul 28, 2015|
|KDE Reveals Plasma Mobile||Jul 28, 2015|
|Huge Package Overhaul for Debian and Ubuntu||Jul 23, 2015|
|diff -u: What's New in Kernel Development||Jul 22, 2015|
|Shashlik - a Tasty New Android Simulator||Jul 21, 2015|
- Secure Server Deployments in Hostile Territory, Part II
- Hacking a Safe with Bash
- KDE Reveals Plasma Mobile
- Huge Package Overhaul for Debian and Ubuntu
- Home Automation with Raspberry Pi
- The Controversy Behind Canonical's Intellectual Property Policy
- Shashlik - a Tasty New Android Simulator
- Embed Linux in Monitoring and Control Systems
- diff -u: What's New in Kernel Development
- General Relativity in Python