Convert SpreadSheets to CSV files with Python and pyuno, Part 1v2
Some months back I developed some pyuno code for converting spreadsheets into CSV files from the command line. Pyuno being the Python interface to the OpenOffice runtime. One of the enhancement suggestions I got was to add the ability to extract all the sheets and/or a specific sheet rather than always extracting the first sheet. The following update to the code does just that.
By way of refreshing our memory, the converter program takes pairs of input and output files, for example:
$ python ssconverter.py file1.xls file1.csv file2.ods file2.csv
Each input file is a spreadsheet and it is converted into the corresponding output file as a CSV file. The new version has been enhanced so that you can append a sheet name or number to the end of the input file name to specify a particular sheet to extract, for example:
$ python ssconverter.py file1.xls:1 file1.csv $ python ssconverter.py file1.xls:Sheet1 file1.csv $ python ssconverter.py file2.ods@1 file2.csv $ python ssconverter.py file2.ods@Sheet2 file2.csv
The sheet name or number (starting at 1) is appended to the input file name separated by a colon (:) or an at sign (@).
In addition, you can convert all the sheets in one pass by specifing a %d or a %s format specifier in the output file name:
$ python ssconverter.py file1.xls file1-%d.csv $ python ssconverter.py file1.xls file1-%s.csv
If the format specifier is %d, then the sheet number is formatted into the output file name, if %s is specified then the sheet name is used. The %d specifier can include zero pad and width specifiers (eg %04d).
The new code has three main changes over the the original code. The first change is to check for an input file sheet specification:
# Check for sheet specification in input file name. match = re.search(r'^(.*)[@:](.*)$', inputFile) if os.path.exists(inputFile) or not match: inputUrl = uno.systemPathToFileUrl(os.path.abspath(inputFile)) inputSheet = '1' # Convert fist sheet. else: inputUrl = uno.systemPathToFileUrl(os.path.abspath(match.group(1))) inputSheet = match.group(2)
The regular expression search checks to see if the input file name can be broken into a file name part and a sheet name part. If it can, and only if the original unbroken input file name does not refer to an existing file, the pieces are used as the input file and sheet specification. Checking to see if the original unbroken input file name refers to an existing file allows you to specify input files that have colons and at signs in their names, and it continues to work even in the case where you have an input file with a colon or at sign in its name and you want to extract a particular sheet, for example:
$ python ssconverter.py this:month.xls:sales output.csv
will correctly recognize that the sheet name is "sales" and that the file name is "this:month.xls". Since the regular expression search is be default greedy the break will occur at the last colon and since "this:month.xls:sales" does not refer to an existing file the match will be used.
The second main change, is related to how the document is loaded. After getting the rest of the code to the point where I thought it ought to be working I discovered while testing that OpenOffice was always converting the first sheet. After some flailing at the code failed to fix the problem I started looking at a few other examples of pyuno code and the only thing I really noted was that many did not use the Hidden property when loading the document. So, I gave that a shot and sure enough it fixed the problem. The accompanying comment has a few more notes:
# Sheet activation does not work properly when Hidden is specified. # Although the sheet does become the active sheet, it's not the sheet that # gets saved if the spreadsheet is loaded with Hidden=True. # # Removing Hidden=True doesn't seem to change anything: nothing appears # on the screen regardless of the Hidden value. # # document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, ooutils.oo_properties(Hidden=True)) document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, ooutils.oo_properties())
The third change is the one where the sheet or sheets are extracted. If a particular sheet is specified the sheet is activated first then saved to the output file:
# Activate the sheet to be converted. if re.search(r'^\d+$', inputSheet): sheet = sheets.getByIndex(int(inputSheet)-1) else: sheet = sheets.getByName(inputSheet) controller.setActiveSheet(sheet) outputUrl = uno.systemPathToFileUrl(os.path.abspath(outputFile)) document.storeToURL(outputUrl, props)
The regular expression here checks to see if the sheet name is a number or a string (sheet number or sheet name).
If all the sheets are being saved then the sheets are activated one by one, the output file name is formatted and the sheet is saved:
# Use the sheet number if the format is %d, otherwise the sheet name. dfmt = re.search(r'%[0-9]*d', outputFile) sfmt = re.search(r'%s', outputFile) if dfmt or sfmt: i = 0 while i < sheets.getCount(): # Activate the sheet. sheet = sheets.getByIndex(i) controller.setActiveSheet(sheet) # Create output file name. if dfmt: ofile = outputFile % (i+1) else: ofile = outputFile % sheet.getName().replace(' ', '_') if verbose: print " %s" % ofile # Save the sheet to the output file. outputUrl = uno.systemPathToFileUrl(os.path.abspath(ofile)) document.storeToURL(outputUrl, props) i += 1
The code in its entirety appears below:
1 #!/usr/bin/python 2 # 3 # Convert spreadsheet to CSV file. 4 # 5 # Based on: 6 # PyODConverter (Python OpenDocument Converter) v1.0.0 - 2008-05-05 7 # Copyright (C) 2008 Mirko Nasato <email@example.com> 8 # Licensed under the GNU LGPL v2.1 - or any later version. 9 # http://www.gnu.org/licenses/lgpl-2.1.html 10 # 11 12 import os 13 import re 14 import ooutils 15 16 import uno 17 from com.sun.star.task import ErrorCodeIOException 18 19 20 21 class SSConverter: 22 """ 23 Spreadsheet converter class. 24 Converts spreadsheets to CSV files. 25 """ 26 27 def __init__(self, oorunner=None): 28 self.desktop = None 29 self.oorunner = None 30 31 32 def convert(self, inputFile, outputFile, verbose=False): 33 """ 34 Convert the input file (a spreadsheet) to a CSV file. 35 36 The input file name can contain a sheet specification to specify a particular sheet. 37 The sheet specification is either a number or a sheet name. 38 The sheet specification is appended to the file name separated by a colon 39 or an at sign: ":" or "@". 40 41 If the output file name contains a %d or %s format specifier, then all the sheets 42 in the input file are converted, otherwise only the first sheet is converted. 43 44 If the output file name contains a %d format specifier then the sheet number 45 is used when formatting the output file name. 46 The format can contain a width specifier (eg %02d). 47 48 If the output file name contains a %s specifier then the sheet name is used 49 when formatting the output file name. 50 """ 51 52 # Start openoffice if needed. 53 if not self.desktop: 54 if not self.oorunner: 55 self.oorunner = ooutils.OORunner() 56 57 self.desktop = self.oorunner.connect() 58 59 # Check for sheet specification in input file name. 60 match = re.search(r'^(.*)[@:](.*)$', inputFile) 61 if os.path.exists(inputFile) or not match: 62 inputUrl = uno.systemPathToFileUrl(os.path.abspath(inputFile)) 63 inputSheet = '1' # Convert fist sheet. 64 else: 65 inputUrl = uno.systemPathToFileUrl(os.path.abspath(match.group(1))) 66 inputSheet = match.group(2) 67 68 69 # NOTE: 70 # Sheet activation does not work properly when Hidden is specified. 71 # Although the sheet does become the active sheet, it's not the sheet that 72 # gets saved if the spreadsheet is loaded with Hidden=True. 73 # 74 # Removing Hidden=True doesn't seem to change anything: nothing appears 75 # on the screen regardless of the Hidden value. 76 # 77 # document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, ooutils.oo_properties(Hidden=True)) 78 document = self.desktop.loadComponentFromURL(inputUrl, "_blank", 0, ooutils.oo_properties()) 79 80 try: 81 props = ooutils.oo_properties(FilterName="Text - txt - csv (StarCalc)") 82 # 83 # Another useful property option: 84 # FilterOptions="59,34,0,1" 85 # 59 - Field separator (semicolon), this is the ascii value. 86 # 34 - Text delimiter (double quote), this is the ascii value. 87 # 0 - Character set (system). 88 # 1 - First line number to export. 89 # 90 # For more information see: 91 # http://wiki.services.openoffice.org/wiki/Documentation/DevGuide/Spreadsheets/Filter_Options 92 93 # To convert a particular sheet, the sheet needs to be active. 94 # To activate a sheet we need the spreadsheet-view, to get the spreadsheet-view 95 # we need the spreadsheet-controller, to get the spreadsheet-controller 96 # we need the spreadsheet-model. 97 # 98 # The spreadsheet-model interface is available from the document object. 99 # The spreadsheet-view interface is available from the controller. 100 # 101 controller = document.getCurrentController() 102 sheets = document.getSheets() 103 104 # If the output file name contains a %d or %s format specifier, convert all sheets. 105 # Use the sheet number if the format is %d, otherwise the sheet name. 106 dfmt = re.search(r'%[0-9]*d', outputFile) 107 sfmt = re.search(r'%s', outputFile) 108 109 if dfmt or sfmt: 110 i = 0 111 while i < sheets.getCount(): 112 # Activate the sheet. 113 sheet = sheets.getByIndex(i) 114 controller.setActiveSheet(sheet) 115 116 # Create output file name. 117 if dfmt: 118 ofile = outputFile % (i+1) 119 else: 120 ofile = outputFile % sheet.getName().replace(' ', '_') 121 122 if verbose: print " %s" % ofile 123 124 # Save the sheet to the output file. 125 outputUrl = uno.systemPathToFileUrl(os.path.abspath(ofile)) 126 document.storeToURL(outputUrl, props) 127 i += 1 128 129 else: 130 # Activate the sheet to be converted. 131 if re.search(r'^\d+$', inputSheet): 132 sheet = sheets.getByIndex(int(inputSheet)-1) 133 else: 134 sheet = sheets.getByName(inputSheet) 135 136 controller.setActiveSheet(sheet) 137 outputUrl = uno.systemPathToFileUrl(os.path.abspath(outputFile)) 138 document.storeToURL(outputUrl, props) 139 finally: 140 if document: document.close(True) 141 142 143 if __name__ == "__main__": 144 from sys import argv 145 from os.path import isfile 146 147 if len(argv) == 2 and argv == '--shutdown': 148 ooutils.oo_shutdown_if_running() 149 else: 150 if len(argv) < 3 or len(argv) % 2 != 1: 151 print "USAGE:" 152 print " python %s INPUT-FILE[:SHEET] OUTPUT-FILE ..." % argv 153 print "OR" 154 print " python %s --shutdown" % argv 155 exit(255) 156 157 try: 158 i = 1 159 converter = SSConverter() 160 161 while i+1 < len(argv): 162 print '%s => %s' % (argv[i], argv[i+1]) 163 converter.convert(argv[i], argv[i+1], True) 164 i += 2 165 166 except ErrorCodeIOException, exception: 167 print "ERROR! ErrorCodeIOException %d" % exception.ErrCode 168 exit(1)
Mitch Frazier is an Associate Editor for Linux Journal.
Practical Task Scheduling Deployment
July 20, 2016 12:00 pm CDT
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.Register Now!
- Stunnel Security for Oracle
- SourceClear Open
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- SUSE LLC's SUSE Manager
- My +1 Sword of Productivity
- Managing Linux Using Puppet
- Google's SwiftShader Released
- Non-Linux FOSS: Caffeine!
- Parsing an RSS News Feed with a Bash Script
- Doing for User Space What We Did for Kernel Space
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide