Reading Native Excel Files in Perl
In my last article for Linux Journal's web edition, I discussed a web-based program that queried an SQL database and output a native Excel file. That article was based on a program I wrote for a customer some time ago. Today, I'm going to write about doing the exact opposite; today's program will accept an Excel spreadsheet, collect data from it, and make additions to an SQL database. This program is also based on a program I'm writing for a customer. In fact, I wrote both programs for the same customer.
In the past, my customer would upload data into the database using a standard ASCII file. They worked with their data in Excel and would export it when it was time to put it into the SQL database. However, it seems that they invariably had difficulty remembering if their data was supposed to be exported as comma-delimited, or tab-delimited format. This fact alone made for some frustrating times. But it got worse.
Comma-delimited data is easy to read in a text editor, and not too difficult to manipulate programatically. This format works great with most types of data such as names and phone numbers. But it doesn't work well for all types of data. For example, comma-delimited format works great for storing information about my friend:
John,Q,Public,15055551234
It's easy to see that we've got a first name, middle initial, last name, and phone number. But, comma-delimited format doesn't do so well for storing information about my friend's son, John Q. Public, Jr:
John,Q,Public, Jr.,15055551234
So, it seems that perhaps tab-delimited format might be better:
John Q Public, Jr. 15055551234
Well, that LOOKS a lot better. However, there is a problem. When I typed this data in, I put a space at the end of the first name by mistake. Can you see it? Nope, neither can I, but it's there and it could cause a lot of problems for us down the road.
Finally, what if we had more than just a handful fields to deal with? Tab-delimited format would quickly become unwieldy to look at with an editor.
So it seems that keeping our data in native Excel format is a nice choice. It may seem that I'm exaggerating a bit with the discussion above, but my customer has run into each of these problems in the past and it's been
frustrating for them. So, we decided to use Excel for both output and input data formats. This works well for my customer since they are familiar with manipulating data in Excel and I'm able to do testing with spreadsheets created in OpenOffice.
The only remaining issue is making sure we all agree on which columns hold what data fields. Our program takes a few extra steps to make sure this problem goes away.
For this article, I've contrived a scenario that allows me to demonstrate many of the features of the Spreadsheet::ParseExcel Perl module.
In this example, we've got a bunch of people going around to schools and collecting data about the children. We're collecting full name and phone number. We're also interested in knowing if they have an A average or an F average. We also want to be able to keep the data separated by school. For those kids with an A average, we're changing the color of their name to green; for those kids with an F average, were setting their name to red. Also, we're creating a new worksheet for each school and naming the sheet after the school the child attends. Obviously, this method isn't very efficient, but like I said, this is a contrived example meant to demonstrate various parts of the module.
The point of this demonstration is to get the raw data out of the spreadsheet, get the names of each of the
worksheets in the spreadsheet, and gather information about the formatting in each cell. Once you see how all of this is accomplished, you'll be able to use the module to do whatever you might need to do.
But first, we need a database to store the results in. We're only needing one table to store our data in, and here it is:
drop table children;
create table children (
name varchar(50),
phone char(11),
school varchar(50),
code char(6)
);
I know, using a varchar(50) to store a kid's name and school is probably overkill. Also, you'll notice that I'm using a char(6) for the code field. The code field will store the color information from the name field, in RRGGBB hex format. Once the data is in the database, it wouldn't take too much effort to pretty things up a bit.
Now that we've got our database table defined, we need a web form so that we can upload our data. This simple
form will get the job done:
<html>
<head>
<title>Data Upload Page</title>
</head>
<body>
Upload Your Data
<form name=main method=POST action=/cgi-bin/load.pl ENCTYPE="multipart/form-data">
<input type=file name=file>
<input type=submit>
</form>
</body>
</html>
Now all we do is save this file to an html file in our web server's webspace. When our data collectors point their browsers at the resulting URL, they'll be able to upload their Excel data into our database.
What we need next is some data. Take a look at figures 1 and 2.


Well, everything looks ok... except that someone goofed. The columns for School A are in a different order than they are for School B. It will be easier for our program to deal with this problem than it would be to get our users to create their document consistently. From here, we can see that we've got data for two schools, School A, and School B. We can also see that one of our students is an A student and one is an F student. The rest are presumably somewhere in between.
Well, lets look at the program, finally.
Not that this program is Rocket Science, I'll break it down into pieces and discuss each in turn.
First the boilerplate:
==================================================
#!/usr/bin/perl
use DBI;
use CGI;
use Spreadsheet::ParseExcel;
$cgi = new CGI;
$dbh = DBI->connect("dbi:Pg:dbname=test;host=db.example.com", "postgres", "password") ||
die "Can't connect to database.\n";
$file = $cgi->param("file");
$workbook = Spreadsheet::ParseExcel::Workbook->Parse($cgi->param("file"));
==================================================
As you can see, this is pretty much the same as the program I discussed before. I'm not using the “-w” or the “use strict” features of perl because once this program works and produces correct output, I'm not interested in whether all of my variables are initialized and such; I'm interested in the results of the program.
Here we create a new CGI object as well as a DBI and Spreadsheet::ParseExcel object.
In this case, we pass a file handle from the CGI object to the Spreadsheet::ParseExcel constructor for parsing.
Going further:
==================================================
foreach $sheet (@{$workbook->{Worksheet}}) {
foreach $col ($sheet->{MinCol} .. $sheet->{MaxCol}) {
if ($sheet->{Cells}[0][$col]->{Val} eq "Name") {
$name = $col;
}
if ($sheet->{Cells}[0][$col]->{Val} eq "Phone") {
$phone = $col;
}
}
==================================================
In this block of code, we start a loop over each worksheet in the workbook. Then, for each worksheet, we determine which range of columns our data collectors used and start looking at our column headings. We're trying to find out which columns contain the Name and Phone fields. Once we know how our columns are organized, we're ready to begin parsing the data.
==================================================
foreach $row ($sheet->{MinRow}+1 .. $sheet->{MaxRow}) {
$child_name = $sheet->{Cells}[$row][$name]->{Val};
$child_phone = $sheet->{Cells}[$row][$phone]->{Val};
$child_school = $sheet->{Name};
$child_code = Spreadsheet::ParseExcel->ColorIdxToRGB(
$sheet->{Cells}[$row][$name]->{Format}->{Font}->{Color});
==================================================
In this block, we loop over each row, much like we looped over each column earlier. Only this time, we skip the row that contains the headings. Then we use the $sheet object to get the cell values for the child_name and child_phone fields. Then we grab the school name from the name of each worksheet.
Finally, we grab some formatting information from the name column.
==================================================
$dbh->do("insert into children (name,phone,school,code) values
(\'$child_name\', \'$child_phone\',
\'$child_school\', \'$child_code\')");
}
}
==================================================
Next, we insert the data into the database.
==================================================
print $cgi->header();
print <<EOF
<html>
<head>
<title>File Has Been Uploaded
</head>
<body>
Thank You.
</body>
</html>
EOF
;
exit;
==================================================
Finally, we tell the user that we've taken their uploaded information and inserted it into the database. Nothing really fancy here. In fact, in the real world, it would make sense to gather and display some sort of summary data indicating that ALL of the data was inserted and indicating if any errors were detected. But for our example, this will do just fine.
Now that the example data file has been uploaded, we can ask our database to show us what it has:
The query: select * from children;
results in:
name | phone | school | code ------------+-------------+----------+-------- John Smith | 5551234 | School A | 000000 Sam Adams | 5554321 | School A | 00FF00 Jane Doe | 5550000 | School A | FF0000 Mike D | 5552222 | School B | 000000 John A | 5559999 | School B | 000000 (5 rows)
And this is exactly what we expected.
So, by using the Spreadsheet::ParseExcel Perl module, we've eliminated the possibility of our users trying to import data in the wrong format and we've made it easy for them to manipulate their data in whatever program they feel most comfortable with, while still ensuring that we can eventually get the resulting data into our SQL database.
This was a sample program that solved a contrived problem, but I hope it demonstrated how easy it is to deal with Excel files in Perl.
Mike Diehl is a freelance Computer Nerd specializing in Linux administration, programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Speed Up Your Web Site with Varnish | Jun 19, 2013 |
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
- Speed Up Your Web Site with Varnish
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Android's Limits
- Weechat, Irssi's Little Brother
- Yeah, user namespaces are
40 min 40 sec ago - Cari Uang
4 hours 11 min ago - user namespaces
7 hours 5 min ago - yea
7 hours 31 min ago - One advantage with VMs
9 hours 59 min ago - about info
10 hours 32 min ago - info
10 hours 33 min ago - info
10 hours 34 min ago - info
10 hours 36 min ago - info
10 hours 38 min ago
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?



Comments
database
Hi. I am quite new to perl and SQL. what file name and extension should i use for the database before i perform the html and perl steps that you have described? thanks