Convert SpreadSheets to CSV files with Python and pyuno, Part 1v2
Some months back I developed some pyuno code for converting spreadsheets into CSV files from the command line. Pyuno being the Python interface to the OpenOffice runtime. One of the enhancement suggestions I got was to add the ability to extract all the sheets and/or a specific sheet rather than always extracting the first sheet. The following update to the code does just that.
By way of refreshing our memory, the converter program takes pairs of input and output files, for example:
$ python ssconverter.py file1.xls file1.csv file2.ods file2.csv
Each input file is a spreadsheet and it is converted into the corresponding output file as a CSV file. The new version has been enhanced so that you can append a sheet name or number to the end of the input file name to specify a particular sheet to extract, for example:
$ python ssconverter.py file1.xls:1 file1.csv
$ python ssconverter.py file1.xls:Sheet1 file1.csv
$ python ssconverter.py file2.ods@1 file2.csv
$ python ssconverter.py file2.ods@Sheet2 file2.csv
The sheet name or number (starting at 1) is appended to the input file name separated by a colon (:) or an at sign (@).
In addition, you can convert all the sheets in one pass by specifing a %d or a %s format specifier in the output file name:
$ python ssconverter.py file1.xls file1-%d.csv
$ python ssconverter.py file1.xls file1-%s.csv
If the format specifier is %d, then the sheet number is formatted into the output file name, if %s is specified then the sheet name is used. The %d specifier can include zero pad and width specifiers (eg %04d).
The new code has three main changes over the the original code. The first change is to check for an input file sheet specification:
# Check for sheet specification in input file name.
match = re.search(r'^(.*)[@:](.*)$', inputFile)
if os.path.exists(inputFile) or not match:
inputUrl = uno.systemPathToFileUrl(os.path.abspath(inputFile))
inputSheet = '1' # Convert fist sheet.
else:
inputUrl = uno.systemPathToFileUrl(os.path.abspath(match.group(1)))
inputSheet = match.group(2)
The regular expression search checks to see if the input file name can be broken into a file name part and a sheet name part. If it can, and only if the original unbroken input file name does not refer to an existing file, the pieces are used as the input file and sheet specification. Checking to see if the original unbroken input file name refers to an existing file allows you to specify input files that have colons and at signs in their names, and it continues to work even in the case where you have an input file with a colon or at sign in its name and you want to extract a particular sheet, for example:
$ python ssconverter.py this:month.xls:sales output.csv
will correctly recognize that the sheet name is "sales" and that the file name is "this:month.xls". Since the regular expression search is be default greedy the break will occur at the last colon and since "this:month.xls:sales" does not refer to an existing file the match will be used.
The second main change, is related to how the document is loaded. After getting the rest of the code to the point where I thought it ought to be working I discovered while testing that OpenOffice was always converting the first sheet. After some flailing at the code failed to fix the problem I started looking at a few other examples of pyuno code and the only thing I really noted was that many did not use the Hidden property when loading the document. So, I gave that a shot and sure enough it fixed the problem. The accompanying comment has a few more notes:
# Sheet activation does not work properly when Hidden is specified.
# Although the sheet does become the active sheet, it's not the sheet that
# gets saved if the spreadsheet is loaded with Hidden=True.
#
# Removing Hidden=True doesn't seem to change anything: nothing appears
# on the screen regardless of the Hidden value.
#
# document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, ooutils.oo_properties(Hidden=True))
document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, ooutils.oo_properties())
The third change is the one where the sheet or sheets are extracted. If a particular sheet is specified the sheet is activated first then saved to the output file:
# Activate the sheet to be converted.
if re.search(r'^\d+$', inputSheet):
sheet = sheets.getByIndex(int(inputSheet)-1)
else:
sheet = sheets.getByName(inputSheet)
controller.setActiveSheet(sheet)
outputUrl = uno.systemPathToFileUrl(os.path.abspath(outputFile))
document.storeToURL(outputUrl, props)
The regular expression here checks to see if the sheet name is a number or a string (sheet number or sheet name).
If all the sheets are being saved then the sheets are activated one by one, the output file name is formatted and the sheet is saved:
# Use the sheet number if the format is %d, otherwise the sheet name.
dfmt = re.search(r'%[0-9]*d', outputFile)
sfmt = re.search(r'%s', outputFile)
if dfmt or sfmt:
i = 0
while i < sheets.getCount():
# Activate the sheet.
sheet = sheets.getByIndex(i)
controller.setActiveSheet(sheet)
# Create output file name.
if dfmt:
ofile = outputFile % (i+1)
else:
ofile = outputFile % sheet.getName().replace(' ', '_')
if verbose: print " %s" % ofile
# Save the sheet to the output file.
outputUrl = uno.systemPathToFileUrl(os.path.abspath(ofile))
document.storeToURL(outputUrl, props)
i += 1
The code in its entirety appears below:
1 #!/usr/bin/python
2 #
3 # Convert spreadsheet to CSV file.
4 #
5 # Based on:
6 # PyODConverter (Python OpenDocument Converter) v1.0.0 - 2008-05-05
7 # Copyright (C) 2008 Mirko Nasato <mirko@artofsolving.com>
8 # Licensed under the GNU LGPL v2.1 - or any later version.
9 # http://www.gnu.org/licenses/lgpl-2.1.html
10 #
11
12 import os
13 import re
14 import ooutils
15
16 import uno
17 from com.sun.star.task import ErrorCodeIOException
18
19
20
21 class SSConverter:
22 """
23 Spreadsheet converter class.
24 Converts spreadsheets to CSV files.
25 """
26
27 def __init__(self, oorunner=None):
28 self.desktop = None
29 self.oorunner = None
30
31
32 def convert(self, inputFile, outputFile, verbose=False):
33 """
34 Convert the input file (a spreadsheet) to a CSV file.
35
36 The input file name can contain a sheet specification to specify a particular sheet.
37 The sheet specification is either a number or a sheet name.
38 The sheet specification is appended to the file name separated by a colon
39 or an at sign: ":" or "@".
40
41 If the output file name contains a %d or %s format specifier, then all the sheets
42 in the input file are converted, otherwise only the first sheet is converted.
43
44 If the output file name contains a %d format specifier then the sheet number
45 is used when formatting the output file name.
46 The format can contain a width specifier (eg %02d).
47
48 If the output file name contains a %s specifier then the sheet name is used
49 when formatting the output file name.
50 """
51
52 # Start openoffice if needed.
53 if not self.desktop:
54 if not self.oorunner:
55 self.oorunner = ooutils.OORunner()
56
57 self.desktop = self.oorunner.connect()
58
59 # Check for sheet specification in input file name.
60 match = re.search(r'^(.*)[@:](.*)$', inputFile)
61 if os.path.exists(inputFile) or not match:
62 inputUrl = uno.systemPathToFileUrl(os.path.abspath(inputFile))
63 inputSheet = '1' # Convert fist sheet.
64 else:
65 inputUrl = uno.systemPathToFileUrl(os.path.abspath(match.group(1)))
66 inputSheet = match.group(2)
67
68
69 # NOTE:
70 # Sheet activation does not work properly when Hidden is specified.
71 # Although the sheet does become the active sheet, it's not the sheet that
72 # gets saved if the spreadsheet is loaded with Hidden=True.
73 #
74 # Removing Hidden=True doesn't seem to change anything: nothing appears
75 # on the screen regardless of the Hidden value.
76 #
77 # document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, ooutils.oo_properties(Hidden=True))
78 document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, ooutils.oo_properties())
79
80 try:
81 props = ooutils.oo_properties(FilterName="Text - txt - csv (StarCalc)")
82 #
83 # Another useful property option:
84 # FilterOptions="59,34,0,1"
85 # 59 - Field separator (semicolon), this is the ascii value.
86 # 34 - Text delimiter (double quote), this is the ascii value.
87 # 0 - Character set (system).
88 # 1 - First line number to export.
89 #
90 # For more information see:
91 # http://wiki.services.openoffice.org/wiki/Documentation/DevGuide/Spreadsheets/Filter_Options
92
93 # To convert a particular sheet, the sheet needs to be active.
94 # To activate a sheet we need the spreadsheet-view, to get the spreadsheet-view
95 # we need the spreadsheet-controller, to get the spreadsheet-controller
96 # we need the spreadsheet-model.
97 #
98 # The spreadsheet-model interface is available from the document object.
99 # The spreadsheet-view interface is available from the controller.
100 #
101 controller = document.getCurrentController()
102 sheets = document.getSheets()
103
104 # If the output file name contains a %d or %s format specifier, convert all sheets.
105 # Use the sheet number if the format is %d, otherwise the sheet name.
106 dfmt = re.search(r'%[0-9]*d', outputFile)
107 sfmt = re.search(r'%s', outputFile)
108
109 if dfmt or sfmt:
110 i = 0
111 while i < sheets.getCount():
112 # Activate the sheet.
113 sheet = sheets.getByIndex(i)
114 controller.setActiveSheet(sheet)
115
116 # Create output file name.
117 if dfmt:
118 ofile = outputFile % (i+1)
119 else:
120 ofile = outputFile % sheet.getName().replace(' ', '_')
121
122 if verbose: print " %s" % ofile
123
124 # Save the sheet to the output file.
125 outputUrl = uno.systemPathToFileUrl(os.path.abspath(ofile))
126 document.storeToURL(outputUrl, props)
127 i += 1
128
129 else:
130 # Activate the sheet to be converted.
131 if re.search(r'^\d+$', inputSheet):
132 sheet = sheets.getByIndex(int(inputSheet)-1)
133 else:
134 sheet = sheets.getByName(inputSheet)
135
136 controller.setActiveSheet(sheet)
137 outputUrl = uno.systemPathToFileUrl(os.path.abspath(outputFile))
138 document.storeToURL(outputUrl, props)
139 finally:
140 if document: document.close(True)
141
142
143 if __name__ == "__main__":
144 from sys import argv
145 from os.path import isfile
146
147 if len(argv) == 2 and argv[1] == '--shutdown':
148 ooutils.oo_shutdown_if_running()
149 else:
150 if len(argv) < 3 or len(argv) % 2 != 1:
151 print "USAGE:"
152 print " python %s INPUT-FILE[:SHEET] OUTPUT-FILE ..." % argv[0]
153 print "OR"
154 print " python %s --shutdown" % argv[0]
155 exit(255)
156
157 try:
158 i = 1
159 converter = SSConverter()
160
161 while i+1 < len(argv):
162 print '%s => %s' % (argv[i], argv[i+1])
163 converter.convert(argv[i], argv[i+1], True)
164 i += 2
165
166 except ErrorCodeIOException, exception:
167 print "ERROR! ErrorCodeIOException %d" % exception.ErrCode
168 exit(1)
| Attachment | Size |
|---|---|
| ssconverter.py_.txt | 6.26 KB |
| ooutils.py_.txt | 4.61 KB |
Mitch Frazier is an Associate Editor for Linux Journal.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- RSS Feeds
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- A Topic for Discussion - Open Source Feature-Richness?
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Validate an E-Mail Address with PHP, the Right Way
- What's the tweeting protocol?
- Tech Tip: Really Simple HTTP Server with Python
- Kernel Problem
1 hour 59 min ago - BASH script to log IPs on public web server
6 hours 26 min ago - DynDNS
10 hours 2 min ago - Reply to comment | Linux Journal
10 hours 34 min ago - All the articles you talked
12 hours 58 min ago - All the articles you talked
13 hours 1 min ago - All the articles you talked
13 hours 2 min ago - myip
17 hours 27 min ago - Keeping track of IP address
19 hours 18 min ago - Roll your own dynamic dns
1 day 32 min ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?



Comments
Memory leak update
After investigating deeper, this script does shutdown OpenOffice. This works around the OpenOffice memory leak problems. Good.
The memory leak issue I saw appears if OpenOffice is already started BEFORE calling this script (headless instance started on the same port). Uno seems to be confused with what instance of OpenOffice to use and close. I ended up with hundreds of openoffice processes.
Not prestarting OpenOffice removed the confusion.
Cheers
Memory leak: known issue -> Restart OO
About my previous memory leak report.
It seems that this is a known issue. JODConverter does restart OO for each conversion as a workaround.
See http://www.openoffice.org/issues/show_bug.cgi?id=41675
Memory leak when running multiple times?
Thanks for showing us your code.
I tried to call your code on a few hundreds of spreasheets (1 call per 1 spreadsheet) and I noticed that the conversion (Excel to CSV) became slower after each call. In fact, something was eating more and more memory. My server eventually ran out of memory/swap and stopped.
With the calling code on my side, I had previously tried JODConvertor and Roo. I did not have such memory issues.
BTW, JODConverter worked fine but it does not support multiple sheets.
Roo has support for multiple sheets but it has two major issues for me. Limited support for Asian encoded characters. And silent crash for lines containings Merged cells (the lines are just skipped in the CSV output).
Problems with ooutils.py in Debian/Ubuntu
Hey,
I have a problem using this script in Ubuntu,
NameError: global name 'OPENOFFICE_BIN' is not defined
in order to solve it, edit ooutils.py, go to line 19 and add this path:
/usr/lib/openoffice/program/
Many thanks for share your code,
Regards
Convert SpreadSheets to CSV files.
Thanks for listening. I posted a request for this functionality some time ago. As I couldn't figure out how to do it, I resourced to xls2csv, a Perl module. As to date it has worked fine, specially since sometimes the files to convert are large, say 200+ cols and 1500+ rows. Only complaint is that the installation is not as easy as installing OpenOffice.org with a simple apt-get install.
I will give it a try early next week.
Regards