In my work, I often find myself writing CGI programs that need to read or write information on the file system attached to a web server. Sometimes, this information is fairly simple, throwaway stuff, such as logging information accrued when I am trying to debug a particularly difficult program.
Sometimes, as we saw in a series of columns earlier this year, we can use text files for the storage and retrieval of structured information, such as the questions and answers for a multiple-choice quiz. Those quizzes were stored in a simple format, with each question placed on a line by itself. For example, here is a line that might have come from one of those quizzes:
What color was George Washington's white horse? White Black Gray Pink a
While the mechanics of magazine publishing mean that you cannot see the difference between various whitespace characters, the above line is separated into six fields: The question text, the four answers presented to the user, and a letter (a, b, c or d) indicating which of the four answers is correct. Fields are separated by Tab characters (ASCII 9), which look identical to space characters (ASCII 32), but which are quite different as far as the computer is concerned.
The quiz programs we explored earlier this year expected to read from files containing one or more such lines, with each line representing a single question. A quiz containing a single question (for users who prefer easy challenges) would have one line, while a quiz containing 1000 questions would contain 1000 lines.
This raises the important issue of scalability, the software's ability to remain efficient even when data sets become quite large. It is not difficult to write programs that can handle small amounts of data efficiently, particularly as hardware continues to drop in price while increasing in performance. It is much harder, though, to write software that can handle large amounts of data.
ASCII text files are wonderful when dealing with small amounts of data, since they are easy to manipulate from within programs, particularly when using Perl, which is strong in handling regular expressions. But when we must work with a large amount of data, or when we want to perform sophisticated searches, we may find ourselves reinventing the wheel or working with tools (such as ASCII text files) that no longer fit our needs.
A common solution to this problem is to off-load the data storage and retrieval to a program known as a relational database server. The “server” part of the name indicates that it expects to receive requests from one or more clients, and the “database” part of the name indicates its storage and retrieval of information on behalf of those clients. You may, however, be unfamiliar with the “relational” part of the name, which means that data is stored in sets of tables, which we can access using SQL, the Structured Query Language. This month, we take a first look at SQL queries, including how they can be integrated into our CGI programs; in the coming months, we will explore this topic in greater depth, using relational database servers for a variety of projects.
SQL is an international standard to which many corporate databases adhere. While the “L” in SQL stands for “language”, it does not mean that you can write programs in SQL. Rather, SQL is a language for formulating queries to database servers. The SQL commands must be incorporated into programs written in a true programming language, such as Perl or C.
Relational databases work on the client-server model, just as the Web does. Whereas web clients and servers communicate using HTTP, database clients and servers communicate with SQL. Needless to say, SQL is much more complicated than HTTP, although as you will see, it is fairly straightforward to learn. SQL may be easy to learn, but that does not mean it is simple. On the contrary, long-time database administrators and programmers understand more about the storage and retrieval of data using SQL than I could ever imagine.
The key to understanding SQL is to realize that everything in an SQL database is stored in a table. Rows in the table represent table records, while columns represent fields. Thus, we could represent an address book as a table.
Name Telephone ---- --------- Reuven 04-824-2265 Andy 02-123-4567 Gil 04-999-8888
There are three records in this table, each represented by a row. Each record contains two fields, each represented by a column. Each table and column must have a name, so we will call this the “phone_book” table, with two columns, “name” and “telephone”.
So far, this might not seem like a great advance over what we have done with text files. Why bother with rows, columns, and tables when we can use an ASCII file?
The simple answer is that we can allow the database server to do the work for us—and it will return an answer to us very quickly, as per our instructions, without getting bogged down by the number of records in the database. If we were interested in finding Andy's telephone number with a text file version of the above table, we would need to iterate through the entire file, checking each record for a match. With a relational database, we can issue an SQL query to the database server, asking for only those rows which match our particular criteria.
Thus, if we were interested in retrieving Andy's telephone number from the table above, we could use an SQL select command to do so:
select telephone from phone_book where name = "Andy";
The SQL statement above asks the database server to return the telephone column from the table named phone_book, for each row in which the name is Andy. If a single row matches the query, we receive a single row as a response from the database server, but if multiple rows match, we receive all of those rows. If no row matches our query, we receive no rows, which might seem odd, until you realize that database client programs often iterate over the results returned to them. Iterating over no values is as easy as iterating over 100 values, although most good client programs check to make sure that at least one row was returned.
We can insert a row into our table with the following:
insert into phone_book (name,telephone) values ("Iris","04-999-8888");
After performing the operation above, our table looks like:
Name Telephone---- ---------Reuven 04-824-2265Andy 02-123-4567Gil 04-999-8888Iris 04-999-8888
which we can see by retrieving everything, using an asterisk to mean “all columns”:
select * from phone_book;
If we want to retrieve all of the rows belonging to people with the telephone number 04-999-8888, we use this line:
select name from phone_book where telephone = "04-999-8888";Note that we do not need to worry about two identical records, since relational databases strictly require each record to be unique in some way. Two rows might differ in only a single column, but that column is enough to make the rows distinct.
One advantage, then, of using SQL and a relational database server is the increased efficiency, both of our programs (which no longer need to read the entire contents of a text file) and ourselves (since we no longer need to write matching engines and define data formats). There are other advantages to using SQL and relational databases too; most importantly, database servers handle file locking in a sophisticated and efficient way, ensuring that data is not lost while keeping operations moving along quickly.
Relational databases also offer an amazing array of optimization techniques and security levels, among other things. And best of all, SQL is a portable standard that (mostly) works in the same way on a great many database systems; that is, once you learn how to write some basic SQL queries, you will be able to store your data on just about any available platform.
Most of my SQL experience is with Sybase on Solaris systems, but for the purposes of this article, I decided the time had come to install a relational database server on my Linux machine (running Red Hat 4.0 with a number of updated packages, including the 2.0.30 kernel). I decided to download MySQL, a database server that looked small but powerful, and which came in RPM (Red Hat Package Manager) format, allowing me to install it quickly. (Don't confuse MySQL with mSQL, another relational database package available for Linux. For information on how to obtain MySQL, see the sidebar accompanying this article.)