Extract and Parse ODF Files with Python

 in
Use Python to demystify Open Document Format files.

A SAX program requires a class derived from ContentHandler and overriding functions to handle the start, middle and end of each node. The tagHandler class shown in Listing 4 is used primarily to track the start of each node and ignores the contents. Use the names found in the listing as keys in a dictionary. Then, use the keys() method to list the names into a list that you also can sort(). I use this technique quite often to get a sorting of unique members quickly, because the hashing functions in Python dictionaries are very fast. Here's the output from the program:

office:automatic-styles
office:body
office:document-content
office:font-face-decls
office:forms
office:scripts
office:text
style:font-face
style:list-level-properties
style:paragraph-properties
style:style
style:text-properties
text:a
text:line-break
text:list
text:list-item
text:list-level-style-bullet
text:list-style
text:p
text:s
text:sequence-decl
text:sequence-decls
text:span

I sorted the list of keys and printed only the types of tags found. You will have many tags, and the order of the listed tags is not the way they are found in the XML file. The tag you most likely will be concerned with is <text:p>, which has the text in each paragraph. In this example, I want to search for keywords in the text for one or more paragraphs in a document.

Although I can use the SAX version of the program to search for the text, I use the DOM libraries, because the code will be a little easier to write (and subsequently, easier to maintain), and I promised an example of this earlier.

The xml.dom and xml.dom.minidom packages in Python allow for reading in XML files as DOM trees. Both packages come with a standard Python installation. Use the minidom package as it has a smaller footprint and is easier to use than the dom package. The minidom package is sufficient for almost all my XML work, and I have never really needed the heavyweight xml.dom package. See Resources for more information.

Use the minidom packages to read the elements of an XML file into a tree-like structure. The nodes of the tree are objects based on the Node class in Python. The output from Listing 4 provides the names of nodes.

Up to this point, you have been using simple programs to list and extract data from the archive. Now, the next example runs a search operation on the file you've just read in. Look at the program in Listing 5.

The program is designed to work as a class that reads and searches for text in an ODF file. Declaring a class for the ODF reader helps in organizing the code for searching text within a node. The showManifest() member function simply tells me what files exist in the ODF file. In this particular program, I collect all the text as a list of paragraphs, and then I search for the keywords passed in from the command line. If the searched word matches, the paragraph is printed out.

The text found in each <text:p> is Unicode text. You have to convert this to normal text in order to print correctly and/or use in a widget. The encode() command translates the Unicode to a printable string.

Unicode provides a unique number for every character, regardless of the platform, program and language being used. The ability to work seamlessly with the same text across multiple platforms is a great feature for Unicode-enabled applications. Such features do come with a price for some legacy operations. Each Unicode character can have flags as bits set for special printing and so on, which causes a normal print statement to interpret each character as a number instead of text. In Python, the encode() member function of a Unicode string returns a printable version of the string. Here is an example code snippet for that:

def findIt(self,name):
    for s in self.text_in_paras:
        if name in s:
            print s.encode('utf-8')

The code in Listing 5 is not limited to an ODT file. You can modify the code presented here to work with spreadsheet files with an .ods file. Run the program in Listing 3 to get the contents.xml file out, and then run the second program (shown in Listing 4) to list the types of nodes. Below is a sample .ods file; note that this file also has paragraphs in addition to the table tags:

office:automatic-styles
office:body
office:document-content
office:font-face-decls
office:scripts
office:spreadsheet
style:font-face
style:style
style:table-column-properties
style:table-properties
style:table-row-properties
table:table
table:table-cell
table:table-column
table:table-row
text:p

Use the program in Listing 5 to extract and search text from paragraphs as before. A simple modification of changing the text:p to table:table-cell searches for text within cells instead of paragraphs.

To summarize, an ODF file is a zipped archive of several XML files. One of these files contains contents in known tags. Each type of ODF file can have different tags based on stored information. By using introspection and the XML parsing capabilities in Python, you can list the types of nodes in a file and read them into a tree structure. Once read, you can extract data only from those nodes in the tree that are of interest to you.

______________________

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

File is not a zip file

hypertex's picture

I am trying to run listing 5. Am I the only one receiving this error?


File "./odt.py", line 13, in __init__
self.m_odf = zipfile.ZipFile(filename)
File "/usr/lib/python2.6/zipfile.py", line 693, in __init__
self._GetContents()
File "/usr/lib/python2.6/zipfile.py", line 713, in _GetContents
self._RealGetContents()
File "/usr/lib/python2.6/zipfile.py", line 725, in _RealGetContents
raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file

and yet . . .

>>> odfname = ('bill_of_rights.odt')
>>> zipfile.is_zipfile(odfname)
True

I'm confused is_zipfile() returns true, but I get a BadZipfile exception. Same result with several different files, so I don't think it is a bad file. Running python 2.6.4 on fedora 13.

D'OH!

hypertex's picture

the __main__ block should have:

filename = sys.argv[1]
phrase = sys.argv[2]

I keep forgetting that sys.argv[0] is a reference to the script, not the first argument to the script.

LPOD Project

Anonymous's picture

lpod is an ODF library that allow to easily manipulate ODF documents.

Fail? What fail?

Anonymous's picture

To the first commenter, win32 doesn't mean windows 3.2. It means 32-bit windows.
(Windows NT/2000/XP/Vista/7)

FAIL!

Anonymous's picture

Python 2.4.1 (#65, Mar 30 2005, 09:13:57)
[MSC v.1310 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()"
for more information.

however, very good tutorial doh! ;)

greets
Royma

Nice

Nichoals's picture

Nice tutorials.. You'r the only one in the net that write something aboute odt and python clearly.
Do you have a tutorials to pick pictures from the odt file too?

uçak bileti

uçak bileti's picture

Have read all of the posts. Very interesting and much better disciplined than most sessions. and thank you for information.

Plug-in For MS Office

W^L+'s picture

Instead of the ODF-Converter project on SourceForge, with its well-known flaws (cannot set ODF formats as default formats for MS Office, poor fidelity between original file and the imported/exported version, ODF placed in separate submenu instead of the usual file->save as, and intermediate saves still require long export process), I'd recommend Sun's plugin for MS Office. It features much better integration with MS Office and better fidelity import/export.

Either one will trigger warnings about the format not being fully compatible, but for most purposes, ODF is fully-capable of representing MS Office data.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState