Reading Native Excel Files in Perl

In my last article for Linux Journal's web edition, I discussed a web-based program that queried an SQL database and output a native Excel file. That article was based on a program I wrote for a customer some time ago. Today, I'm going to write about doing the exact opposite; today's program will accept an Excel spreadsheet, collect data from it, and make additions to an SQL database. This program is also based on a program I'm writing for a customer. In fact, I wrote both programs for the same customer.

In the past, my customer would upload data into the database using a standard ASCII file. They worked with their data in Excel and would export it when it was time to put it into the SQL database. However, it seems that they invariably had difficulty remembering if their data was supposed to be exported as comma-delimited, or tab-delimited format. This fact alone made for some frustrating times. But it got worse.

Comma-delimited data is easy to read in a text editor, and not too difficult to manipulate programatically. This format works great with most types of data such as names and phone numbers. But it doesn't work well for all types of data. For example, comma-delimited format works great for storing information about my friend:

John,Q,Public,15055551234

It's easy to see that we've got a first name, middle initial, last name, and phone number. But, comma-delimited format doesn't do so well for storing information about my friend's son, John Q. Public, Jr:

John,Q,Public, Jr.,15055551234

So, it seems that perhaps tab-delimited format might be better:

John Q Public, Jr. 15055551234

Well, that LOOKS a lot better. However, there is a problem. When I typed this data in, I put a space at the end of the first name by mistake. Can you see it? Nope, neither can I, but it's there and it could cause a lot of problems for us down the road.

Finally, what if we had more than just a handful fields to deal with? Tab-delimited format would quickly become unwieldy to look at with an editor.

So it seems that keeping our data in native Excel format is a nice choice. It may seem that I'm exaggerating a bit with the discussion above, but my customer has run into each of these problems in the past and it's been
frustrating for them. So, we decided to use Excel for both output and input data formats. This works well for my customer since they are familiar with manipulating data in Excel and I'm able to do testing with spreadsheets created in OpenOffice.

The only remaining issue is making sure we all agree on which columns hold what data fields. Our program takes a few extra steps to make sure this problem goes away.

For this article, I've contrived a scenario that allows me to demonstrate many of the features of the Spreadsheet::ParseExcel Perl module.

In this example, we've got a bunch of people going around to schools and collecting data about the children. We're collecting full name and phone number. We're also interested in knowing if they have an A average or an F average. We also want to be able to keep the data separated by school. For those kids with an A average, we're changing the color of their name to green; for those kids with an F average, were setting their name to red. Also, we're creating a new worksheet for each school and naming the sheet after the school the child attends. Obviously, this method isn't very efficient, but like I said, this is a contrived example meant to demonstrate various parts of the module.

The point of this demonstration is to get the raw data out of the spreadsheet, get the names of each of the
worksheets in the spreadsheet, and gather information about the formatting in each cell. Once you see how all of this is accomplished, you'll be able to use the module to do whatever you might need to do.

But first, we need a database to store the results in. We're only needing one table to store our data in, and here it is:

drop table children;

create table children (
       name varchar(50),
       phone char(11),
       school varchar(50),
       code char(6)
);

I know, using a varchar(50) to store a kid's name and school is probably overkill. Also, you'll notice that I'm using a char(6) for the code field. The code field will store the color information from the name field, in RRGGBB hex format. Once the data is in the database, it wouldn't take too much effort to pretty things up a bit.

Now that we've got our database table defined, we need a web form so that we can upload our data. This simple
form will get the job done:

<html>
<head>
<title>Data Upload Page</title>
</head>
<body>
Upload Your Data
<form name=main method=POST action=/cgi-bin/load.pl ENCTYPE="multipart/form-data">
<input type=file name=file>
<input type=submit>
</form>
</body>
</html>

Now all we do is save this file to an html file in our web server's webspace. When our data collectors point their browsers at the resulting URL, they'll be able to upload their Excel data into our database.

What we need next is some data. Take a look at figures 1 and 2.

Well, everything looks ok... except that someone goofed. The columns for School A are in a different order than they are for School B. It will be easier for our program to deal with this problem than it would be to get our users to create their document consistently. From here, we can see that we've got data for two schools, School A, and School B. We can also see that one of our students is an A student and one is an F student. The rest are presumably somewhere in between.

Well, lets look at the program, finally.

Not that this program is Rocket Science, I'll break it down into pieces and discuss each in turn.

First the boilerplate:

==================================================
#!/usr/bin/perl

use DBI;
use CGI;
use Spreadsheet::ParseExcel;

$cgi = new CGI;

$dbh = DBI->connect("dbi:Pg:dbname=test;host=db.example.com", "postgres", "password") ||
die "Can't connect to database.\n";

$file = $cgi->param("file");

$workbook = Spreadsheet::ParseExcel::Workbook->Parse($cgi->param("file"));
==================================================

As you can see, this is pretty much the same as the program I discussed before. I'm not using the “-w” or the “use strict” features of perl because once this program works and produces correct output, I'm not interested in whether all of my variables are initialized and such; I'm interested in the results of the program.

Here we create a new CGI object as well as a DBI and Spreadsheet::ParseExcel object.

In this case, we pass a file handle from the CGI object to the Spreadsheet::ParseExcel constructor for parsing.

Going further:

==================================================
foreach $sheet (@{$workbook->{Worksheet}}) {

       foreach $col ($sheet->{MinCol} .. $sheet->{MaxCol}) {
               if ($sheet->{Cells}[0][$col]->{Val} eq "Name") {
                       $name = $col;
               }

               if ($sheet->{Cells}[0][$col]->{Val} eq "Phone") {
                       $phone = $col;
               }
       }
==================================================

In this block of code, we start a loop over each worksheet in the workbook. Then, for each worksheet, we determine which range of columns our data collectors used and start looking at our column headings. We're trying to find out which columns contain the Name and Phone fields. Once we know how our columns are organized, we're ready to begin parsing the data.

==================================================
       foreach $row ($sheet->{MinRow}+1 .. $sheet->{MaxRow}) {

               $child_name = $sheet->{Cells}[$row][$name]->{Val};
               $child_phone = $sheet->{Cells}[$row][$phone]->{Val};

               $child_school = $sheet->{Name};

               $child_code = Spreadsheet::ParseExcel->ColorIdxToRGB(
                       $sheet->{Cells}[$row][$name]->{Format}->{Font}->{Color});

==================================================

In this block, we loop over each row, much like we looped over each column earlier. Only this time, we skip the row that contains the headings. Then we use the $sheet object to get the cell values for the child_name and child_phone fields. Then we grab the school name from the name of each worksheet.

Finally, we grab some formatting information from the name column.

==================================================
               $dbh->do("insert into children (name,phone,school,code) values
                       (\'$child_name\', \'$child_phone\',
                       \'$child_school\', \'$child_code\')");
       }
}
==================================================

Next, we insert the data into the database.


==================================================
print $cgi->header();

print <<EOF
<html>
<head>
<title>File Has Been Uploaded>
</head>
<body>
Thank You.
</body>
</html>
EOF
;

exit;
==================================================

Finally, we tell the user that we've taken their uploaded information and inserted it into the database. Nothing really fancy here. In fact, in the real world, it would make sense to gather and display some sort of summary data indicating that ALL of the data was inserted and indicating if any errors were detected. But for our example, this will do just fine.

Now that the example data file has been uploaded, we can ask our database to show us what it has:

The query: select * from children;

results in:

   name    |    phone    |  school  |  code
------------+-------------+----------+--------
 John Smith | 5551234     | School A | 000000
 Sam Adams  | 5554321     | School A | 00FF00
 Jane Doe   | 5550000     | School A | FF0000
 Mike D     | 5552222     | School B | 000000
 John A     | 5559999     | School B | 000000
(5 rows)

And this is exactly what we expected.

So, by using the Spreadsheet::ParseExcel Perl module, we've eliminated the possibility of our users trying to import data in the wrong format and we've made it easy for them to manipulate their data in whatever program they feel most comfortable with, while still ensuring that we can eventually get the resulting data into our SQL database.

This was a sample program that solved a contrived problem, but I hope it demonstrated how easy it is to deal with Excel files in Perl.

Load Disqus comments