Extract and Parse ODF Files with Python
The Open Document Format (ODF) Alliance is designed for sharing information between different word processing applications. This article highlights the basic structure of ODF files, some internals of the underlying XML files and shows how to use Python to read the contents to perform a simple search for keywords. The code also can be the basis for more-advanced operations. In the spirit of openness, we use open-source software to read the ODF files, which in this case are Python and the OpenOffice.org package.
If you are running a recent version of Linux or OS X, you already should have Python and OpenOffice.org installed on your machine. If you need the latest versions, Python is available for free from www.python.org for both the Windows and Linux platforms. The OpenOffice.org package also is available for free from www.openoffice.org. Installing OpenOffice.org on an XP desktop is relatively painless. Download the packages from their respective sites and run the installer. Once installed, simply run the application from the desktop with a click on the installed icons.
Most folks do have Microsoft Office installed. If that's the case, the solution is to use a plugin for Microsoft Word (sourceforge.net/projects/odf-converter). You can install both the OpenOffice.org and Microsoft packages on the same machine without causing any conflicts.
Please read the Bugs section on the SourceForge site for any incompatibilities before you install the plugin. I used the OpenOffice.org suite to save the files for this article, because it was easier.
Once you have confirmed that you have the prerequisites, you can create an ODF file. Open up the Writer, type some text in a document and save it. You can read in a file and save it as an .odt file.
A quick look at extensions in the Save dialog reveals a lot. An ODF file can have many extensions, which provide a clue as to the type of information stored in it and the application that stored it. See Table 1.
Table 1. ODF File Types and Their Extensions
|Document Format||File Extension|
|OpenDocument Text Template||*.ott|
|OpenDocument Master Document||*.odm|
|HTML Document Template||*.oth|
|OpenDocument Spreadsheet Template||*.ots|
|OpenDocument Drawing Template||*.otg|
|OpenDocument Presentation Template||*.otp|
So, what's in an ODF file? An ODF file is basically a zipped archive with several XML files. The actual files and directories in a file will vary depending on the type of information and the system on which the document was created.
The first step in picking out the names of the files in an ODF file requires unzipping the file itself. Fortunately, Python has built-in support for dealing with this endeavor with the zipfile module. Type python on the command line to run an interactive shell. Running a shell allows you to examine the contents of objects returned from the modules. Because you'll probably be doing this only once per type of data, there is really no need to write and execute a script at this time. If you want to preserve the work for future use, it's better to write a script in a text editor or use the IDLE editor that comes with Python and save the script. See Listing 1 on how to show the member functions in a class or module.
Listing 1. Showing the Member Functions in a Class or Module
Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. >>> import zipfile >>> myfile = zipfile.ZipFile ↪('E:/articles/odf/theArticle.odt') >>> dir(myfile) ['NameToInfo', '_GetContents', '_RealGetContents', '__del__', '__doc__', '__init__', '__module__', '_filePassed', '_writecheck', 'close', 'comment', 'compression', 'debug', 'filelist', 'filename', 'fp', 'getinfo', 'infolist', 'mode', 'namelist', 'printdir', 'read', 'start_dir', 'testzip', 'write', 'writestr'] >>> >>> >>> listoffiles = myfile.infolist() >>> dir(listoffiles) ['CRC', 'FileHeader', '__doc__', '__init__', '__module__', 'comment', 'compress_size', 'compress_type', 'create_system', 'create_version', 'date_time', 'external_attr', 'extra', 'extract_version', 'file_offset', 'file_size', 'filename', 'flag_bits', 'header_offset', 'internal_attr', 'orig_filename', 'reserved', 'volume'] >>>
The infolist() command from the Python documentation returns a list the objects of a zipped archive. Run the dir() command on the first object from this list to get more information stored for each zipped file (Listing 2). This nice feature of looking at members in an object is called introspection.
An iteration on the list of objects displays relevant information about each file, such as when it was created, its extracted size, its compressed size and so on. It's better at this point to write a Python script and run it rather than work at the command line of an interactive Python shell. This way, you can preserve the script for later use. So, open up a text editor and type in the script.
|Happy Birthday Linux||Aug 25, 2016|
|ContainerCon Vendors Offer Flexible Solutions for Managing All Your New Micro-VMs||Aug 24, 2016|
|Updates from LinuxCon and ContainerCon, Toronto, August 2016||Aug 23, 2016|
|NVMe over Fabrics Support Coming to the Linux 4.8 Kernel||Aug 22, 2016|
|What I Wish I’d Known When I Was an Embedded Linux Newbie||Aug 18, 2016|
|Pandas||Aug 17, 2016|
- Happy Birthday Linux
- ContainerCon Vendors Offer Flexible Solutions for Managing All Your New Micro-VMs
- Updates from LinuxCon and ContainerCon, Toronto, August 2016
- New Version of GParted
- Puppet and Nagios: a Roadmap to Advanced Configuration
- Blender for Visual Effects
- What I Wish I’d Known When I Was an Embedded Linux Newbie
- NVMe over Fabrics Support Coming to the Linux 4.8 Kernel
- A New Project for Linux at 25
- Tor 0.2.8.6 Is Released
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide