Extract and Parse ODF Files with Python
The Open Document Format (ODF) Alliance is designed for sharing information between different word processing applications. This article highlights the basic structure of ODF files, some internals of the underlying XML files and shows how to use Python to read the contents to perform a simple search for keywords. The code also can be the basis for more-advanced operations. In the spirit of openness, we use open-source software to read the ODF files, which in this case are Python and the OpenOffice.org package.
If you are running a recent version of Linux or OS X, you already should have Python and OpenOffice.org installed on your machine. If you need the latest versions, Python is available for free from www.python.org for both the Windows and Linux platforms. The OpenOffice.org package also is available for free from www.openoffice.org. Installing OpenOffice.org on an XP desktop is relatively painless. Download the packages from their respective sites and run the installer. Once installed, simply run the application from the desktop with a click on the installed icons.
Most folks do have Microsoft Office installed. If that's the case, the solution is to use a plugin for Microsoft Word (sourceforge.net/projects/odf-converter). You can install both the OpenOffice.org and Microsoft packages on the same machine without causing any conflicts.
Please read the Bugs section on the SourceForge site for any incompatibilities before you install the plugin. I used the OpenOffice.org suite to save the files for this article, because it was easier.
Once you have confirmed that you have the prerequisites, you can create an ODF file. Open up the Writer, type some text in a document and save it. You can read in a file and save it as an .odt file.
A quick look at extensions in the Save dialog reveals a lot. An ODF file can have many extensions, which provide a clue as to the type of information stored in it and the application that stored it. See Table 1.
Table 1. ODF File Types and Their Extensions
|Document Format||File Extension|
|OpenDocument Text Template||*.ott|
|OpenDocument Master Document||*.odm|
|HTML Document Template||*.oth|
|OpenDocument Spreadsheet Template||*.ots|
|OpenDocument Drawing Template||*.otg|
|OpenDocument Presentation Template||*.otp|
So, what's in an ODF file? An ODF file is basically a zipped archive with several XML files. The actual files and directories in a file will vary depending on the type of information and the system on which the document was created.
The first step in picking out the names of the files in an ODF file requires unzipping the file itself. Fortunately, Python has built-in support for dealing with this endeavor with the zipfile module. Type python on the command line to run an interactive shell. Running a shell allows you to examine the contents of objects returned from the modules. Because you'll probably be doing this only once per type of data, there is really no need to write and execute a script at this time. If you want to preserve the work for future use, it's better to write a script in a text editor or use the IDLE editor that comes with Python and save the script. See Listing 1 on how to show the member functions in a class or module.
Listing 1. Showing the Member Functions in a Class or Module
Python 2.4.1 (#65, Mar 30 2005, 09:13:57) [MSC v.1310 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. >>> import zipfile >>> myfile = zipfile.ZipFile ↪('E:/articles/odf/theArticle.odt') >>> dir(myfile) ['NameToInfo', '_GetContents', '_RealGetContents', '__del__', '__doc__', '__init__', '__module__', '_filePassed', '_writecheck', 'close', 'comment', 'compression', 'debug', 'filelist', 'filename', 'fp', 'getinfo', 'infolist', 'mode', 'namelist', 'printdir', 'read', 'start_dir', 'testzip', 'write', 'writestr'] >>> >>> >>> listoffiles = myfile.infolist() >>> dir(listoffiles) ['CRC', 'FileHeader', '__doc__', '__init__', '__module__', 'comment', 'compress_size', 'compress_type', 'create_system', 'create_version', 'date_time', 'external_attr', 'extra', 'extract_version', 'file_offset', 'file_size', 'filename', 'flag_bits', 'header_offset', 'internal_attr', 'orig_filename', 'reserved', 'volume'] >>>
The infolist() command from the Python documentation returns a list the objects of a zipped archive. Run the dir() command on the first object from this list to get more information stored for each zipped file (Listing 2). This nice feature of looking at members in an object is called introspection.
An iteration on the list of objects displays relevant information about each file, such as when it was created, its extracted size, its compressed size and so on. It's better at this point to write a Python script and run it rather than work at the command line of an interactive Python shell. This way, you can preserve the script for later use. So, open up a text editor and type in the script.
Practical Task Scheduling Deployment
July 20, 2016 12:00 pm CDT
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.Register Now!
- Stunnel Security for Oracle
- SourceClear Open
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- SUSE LLC's SUSE Manager
- My +1 Sword of Productivity
- Managing Linux Using Puppet
- Non-Linux FOSS: Caffeine!
- Google's SwiftShader Released
- Doing for User Space What We Did for Kernel Space
- Parsing an RSS News Feed with a Bash Script
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide