Designing Databases

Structuring tables can improve database performance-- here's how to do it.
Indexing

In addition to the column definitions and the UNIQUE qualifier, our definition for the StationLines table included three INDEX lines—one for each of the station_id, line_id and north_to_south columns.

While it often helps to think of a relational database table as a glorified spreadsheet with rows and columns, there are some important differences. One is that a database table does not store its rows in any particular order. If we are interested in retrieving rows from the table in a certain order, we must specify it with the ORDER BY clause in our query.

Because rows are not ordered in any particular way, a SELECT query can often take quite a while to fulfill. For example, take the following query:

SELECT id
FROM RailStations
WHERE name = "Tel Aviv Central";

This might not seem like a time-consuming query, given that it involves a single table and a simple WHERE clause. But since the rows of RailStations are not stored in any particular order, finding the rows where the name is “Tel Aviv Central” can take quite a while. This might be a negligible amount of time in the case of a 100-row table, but when a table contains 1,000 or 10,000 rows, the time can become noticeable. In this particular example, the database server is probably smart enough to realize that RailStations.name has been declared UNIQUE, meaning our query will return one row, if it returns anything. This means the server will, on average, have to search through only half of the rows—but that can still take quite a while.

An index changes this picture by adding a pointer to each column value. If RailStations.name is indexed, the MySQL server can almost immediately find those rows containing a particular value. It can also determine whether a value exists at all.

If indexes can increase query speeds so dramatically, why are rows unindexed by default? The main answer is that indexes are written and updated each time an INSERT or UPDATE operation is performed on a table. Since the majority of database queries are SELECTs, in which the index can substantially improve performance, this is normally an acceptable trade-off. However, certain applications must INSERT and UPDATE at maximum speed, in which case creating an index can cause problems.

Since indexes are used in locating columns of a certain value, they are necessary for only those columns that will be named in WHERE clauses. There is no need to index a column that is displayed, but rarely used as a search criterion.

In some cases, it is enough to index the first part of each column rather than the entire column. For example, if we are indexing a column of type VARCHAR(50), then we might be able to index only 10 of those characters. This will retain most of the advantages of a full index (since the first ten characters are rarely identical in such a text field), while reducing the amount of information the index must store.

Entering Train Information

Now that we have thoroughly examined the tables describing the train system, it is time to put some trains on those tracks. The question of how to model this data is a tough one, since there are a number of ways in which to accomplish it. I decided to split this information into two tables, Trains and DepartureTimes.

Each row of Trains describes a particular train, indicating the line on which it runs, the ID numbers of its origin and destination stations, and the time it departs from its origin:

CREATE TABLE Trains (
  id SMALLINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  line_id TINYINT UNSIGNED NOT NULL,
  origin_id TINYINT UNSIGNED NOT NULL,
  destination_id TINYINT UNSIGNED NOT NULL,
  depart_origin_time TIME NOT NULL,
  UNIQUE(line_id, origin_id, destination_id,
    depart_origin_time),
  INDEX(line_id),
  INDEX(origin_id),
  INDEX(destination_id),
  INDEX(depart_origin_time)
);

The first column is a primary key, allowing us to describe each train with a single number. The combination of a rail line, origin, destination and hour should be unique, so we ask the database server to enforce this condition with the UNIQUE keyword.

Finally, we define the DepartureTimes table, which stores information on when a train will leave from a particular station:

CREATE TABLE DepartureTimes (
   train_id SMALLINT UNSIGNED NOT NULL,
   station_id TINYINT UNSIGNED NOT NULL,
   departure_time TIME NOT NULL,
   INDEX(train_id),
   INDEX(station_id),
   INDEX(departure_time)
);

Once we enter information into these tables, we can start to perform sophisticated queries. For example, which trains arrive at “Tel Aviv Central” before 8 a.m.?

SELECT train_id
FROM DepartureTimes
WHERE departure_time < "08:00"
  AND station_id = 4;
Sure enough, this query returns a table containing two rows:
+----------+
| train_id |
+----------+
|        1 |
|        2 |
+----------+
Now we know two trains will arrive in Tel Aviv early enough for us to catch a morning meeting. But which trains are those? It would be nice to get more information than that. One possibility is to print the name of the origin station and the hour at which the train leaves:
SELECT S.name, T.depart_origin_time
FROM DepartureTimes DT, Trains T, RailStations S
WHERE DT.departure_time < "08:00"
  AND DT.station_id = 4
  AND DT.train_id = T.id
  AND S.id = T.origin_id;
Notice how SQL allows us to use < and > when handling dates and times, for columns declared as DATE, TIME or DATETIME. Given the contortions one must use in order to compare dates and times in nearly any programming language, this built-in date comparison is still one of my favorites.

Assuming we want to take the first train of the day (ID 1), we can print the list of when it will arrive at each station:

SELECT T.id, S.name, DT.departure_time
FROM RailStations S, DepartureTimes DT, Trains T,
  StationLines SL
WHERE T.id = DT.train_id
  AND T.id = 1 AND T.line_id = SL.line_id
  AND SL.station_id = DT.station_id
  AND DT.station_id = S.id
ORDER BY T.id, SL.north_to_south
;

We can even print a full schedule for trains to Tel Aviv (ID 5):

SELECT T.id, S.name, DT.departure_time
FROM RailStations S, DepartureTimes DT, Trains T,
  StationLines SL
WHERE T.id = DT.train_id
  AND T.line_id = SL.line_id
  AND SL.station_id = DT.station_id
  AND DT.station_id = S.id
  AND T.destination_id = 5
ORDER BY T.id, SL.north_to_south
;
Finally, we can retrieve a full schedule for trains to Tel Aviv (ID 5) that leave after 9 AM:
SELECT T.id, S.name, DT.departure_time
FROM RailStations S, DepartureTimes DT, Trains T,
  StationLines SL
WHERE T.id = DT.train_id
  AND T.line_id = SL.line_id
  AND SL.station_id = DT.station_id
  AND DT.station_id = S.id
  AND T.destination_id = 5
  AND T.depart_origin_time > "09:00"
ORDER BY T.id, SL.north_to_south
;

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix