Designing Databases

Structuring tables can improve database performance-- here's how to do it.

Relational databases are becoming increasingly popular for web applications. This is generally a good thing, allowing us to focus on the way in which our data is structured, rather than the way it is stored on disk. Offloading data storage and retrieval tasks to a relational database server means our programs can be smaller and easier to maintain.

However, incorporating a database server on a web site is not a cure-all. The database might take care of many necessary tasks, but it cannot design your tables for you, nor determine the best way in which to work with them.

This month, we will look at the art of database design and how we can structure tables to improve performance. Getting the most out of a database is something of a black art, which is why good database administrators (DBAs) are always in high demand. But with a few simple techniques, we can overcome some of the most basic performance problems experienced by web programmers. We will design a database that can handle train schedules. In so doing, we will touch on a number of issues database programmers face when deciding how to design tables in a database.

Trains and Tables

I love trains, and often take the train between Tel Aviv and Haifa when I must travel between those two cities. One day, after leafing through the small paper schedule that Israel's Rail Authority distributes, I realized the implementation of a computerized train schedule is not as obvious as it would appear at first.

Rail schedules often come in the form of a printed table, with the columns representing individual trains and each station in its own row. Each table lists trains on one rail line, in a single direction.

Since relational databases store all of their data in tables, you might think this is a perfect way in which to store the information. In order to allow us to add and delete trains more easily, we will swap the axes from the printed schedule, putting the individual trains in the rows and the stations in the columns.

To define such a table in SQL, we could use a query like this:

CREATE TABLE HaifaToTelAviv (
   haifa_central   TIME NOT NULL,
   haifa_bat_galim TIME NOT NULL,
   binyamina       TIME NOT NULL,
   hof_hacarmel    TIME NOT NULL,
   ta_central      TIME NOT NULL,
   ta_hashalom     TIME NOT NULL
);

Given such a table, we could enter our trains as follows:

INSERT INTO HaifaToTelAviv
   (haifa_central, haifa_bat_galim, binyamina, hof_hacarmel,
     ta_central, ta_hashalom)
VALUES
   ("12:05", "12:10", "12:17", "12:37", "13:16",
    "13:21");
If you have any experience with databases, you can quickly see the terrible problems in store for us here. For starters, what happens if a new station is built between Haifa and Tel Aviv? That would require us to redefine our table, adding a new column, and that's only the beginning. It is a bit absurd that each train line requires two tables, one for each direction. And there isn't any way for me to determine whether a particular rail line serves any two cities—if the cities are represented by columns. What can we do about Tel Aviv? If two cities are close to each other and I can take a train to either one, I will have to query two tables in order to find the answer.

In addition, trying to query information from the above HaifaToTelAviv table would be difficult, requiring us to know the name of the column corresponding to each station. The problems just continue from there—for instance, what do we enter if the express train passes Binyamina? We could define the “binyamina” column to be NULL and enter a NULL value in that column. However, NULL normally indicates that a value is unknown or missing, whereas the reason in this case is much simpler.

Finally, what happens if a new schedule comes out, making each train later by a different amount of time? Editing the schedule in this format would be quite difficult.

Designing for Flexibility

How should we model the train schedule, then, if we cannot do so from the printed schedule? The solution is to break the information into smaller tables, bringing them together to answer questions. Relational databases specialize in this sort of operation, allowing us to “join” two or more tables together.

Breaking the single large table into many smaller tables makes the database more flexible, allowing us to ask many more questions than would otherwise be possible. For example, we should be able to ask questions like:

  • What is the last train from Haifa that will arrive in Tel Aviv before 11:00 a.m.?

  • Are there any express trains from Tel Aviv to Haifa?

  • What time will the 10:00 a.m. train from Binyamina get to Tel Aviv?

If we model our data correctly, breaking it down into sufficiently small and flexible tables, it should be possible to answer any of these questions with a single SQL query.

These examples all use MySQL, a “mostly free” database popular with many web sites. MySQL lacks some of the advanced features of other databases, such as transactions and referential integrity. However, it is easy to install and administer and is extremely fast. You can learn more about MySQL at http://www.mysql.com/.

For example, here is a definition of the RailStations table:

CREATE TABLE RailStations (
   id TINYINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
   name VARCHAR(50) NOT NULL,
   UNIQUE(name)
);

The only reason for RailStations to exist is to associate a numeric ID with each station. It might seem silly to create such a table, when we could enter station names directly wherever we need them.

However, giving each station an ID number gives us two advantages. First of all, we can be sure the station names will be spelled consistently, without variations in spelling, capitalization and abbreviations. Second, an integer consumes less space than the name to which it points. Each tinyint consumes a single byte, whereas a 20-character station name will consume 20 bytes. Referring to the full name would thus consume 20 times as much RAM and disk space.

Notice that we define id to be a column of type TINYINT UNSIGNED. This allows us to assign values between 0 and 255. Large rail systems, with more than 255 stations, would need to use a SMALLINT UNSIGNED, which ranges between 0 and 65535.

We ensure each station name in RailStations is unique by giving it the UNIQUE qualifier. The ID numbers are already guaranteed to be unique because they have been declared the primary key. Better yet, because we specified AUTO_INCREMENT, MySQL will automatically assign an ID number if an INSERT query ignores it. For example:

INSERT INTO RailStations (name)
VALUES ("Nahariya");

If we now query the database:

SELECT id
FROM RailStations
WHERE name = "Nahariya";
we learn that Nahariya has been automatically assigned an ID of 1.

We can insert one or more new rows into the table with a single INSERT statement. For example, the following adds several more rows to RailStations:

INSERT INTO RailStations (name)
VALUES ("Akko"),
       ("Hof Hacarmel"),
       ("Tel Aviv Central"),
       ("Tel Aviv Hashalom"),
       ("Lod"),
       ("Rehovot"),
       ("Herzliya")
;
______________________

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState