Designing Databases
In addition to the column definitions and the UNIQUE qualifier, our definition for the StationLines table included three INDEX lines—one for each of the station_id, line_id and north_to_south columns.
While it often helps to think of a relational database table as a glorified spreadsheet with rows and columns, there are some important differences. One is that a database table does not store its rows in any particular order. If we are interested in retrieving rows from the table in a certain order, we must specify it with the ORDER BY clause in our query.
Because rows are not ordered in any particular way, a SELECT query can often take quite a while to fulfill. For example, take the following query:
SELECT id FROM RailStations WHERE name = "Tel Aviv Central";
This might not seem like a time-consuming query, given that it involves a single table and a simple WHERE clause. But since the rows of RailStations are not stored in any particular order, finding the rows where the name is “Tel Aviv Central” can take quite a while. This might be a negligible amount of time in the case of a 100-row table, but when a table contains 1,000 or 10,000 rows, the time can become noticeable. In this particular example, the database server is probably smart enough to realize that RailStations.name has been declared UNIQUE, meaning our query will return one row, if it returns anything. This means the server will, on average, have to search through only half of the rows—but that can still take quite a while.
An index changes this picture by adding a pointer to each column value. If RailStations.name is indexed, the MySQL server can almost immediately find those rows containing a particular value. It can also determine whether a value exists at all.
If indexes can increase query speeds so dramatically, why are rows unindexed by default? The main answer is that indexes are written and updated each time an INSERT or UPDATE operation is performed on a table. Since the majority of database queries are SELECTs, in which the index can substantially improve performance, this is normally an acceptable trade-off. However, certain applications must INSERT and UPDATE at maximum speed, in which case creating an index can cause problems.
Since indexes are used in locating columns of a certain value, they are necessary for only those columns that will be named in WHERE clauses. There is no need to index a column that is displayed, but rarely used as a search criterion.
In some cases, it is enough to index the first part of each column rather than the entire column. For example, if we are indexing a column of type VARCHAR(50), then we might be able to index only 10 of those characters. This will retain most of the advantages of a full index (since the first ten characters are rarely identical in such a text field), while reducing the amount of information the index must store.
Now that we have thoroughly examined the tables describing the train system, it is time to put some trains on those tracks. The question of how to model this data is a tough one, since there are a number of ways in which to accomplish it. I decided to split this information into two tables, Trains and DepartureTimes.
Each row of Trains describes a particular train, indicating the line on which it runs, the ID numbers of its origin and destination stations, and the time it departs from its origin:
CREATE TABLE Trains (
id SMALLINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
line_id TINYINT UNSIGNED NOT NULL,
origin_id TINYINT UNSIGNED NOT NULL,
destination_id TINYINT UNSIGNED NOT NULL,
depart_origin_time TIME NOT NULL,
UNIQUE(line_id, origin_id, destination_id,
depart_origin_time),
INDEX(line_id),
INDEX(origin_id),
INDEX(destination_id),
INDEX(depart_origin_time)
);
The first column is a primary key, allowing us to describe each train with a single number. The combination of a rail line, origin, destination and hour should be unique, so we ask the database server to enforce this condition with the UNIQUE keyword.
Finally, we define the DepartureTimes table, which stores information on when a train will leave from a particular station:
CREATE TABLE DepartureTimes ( train_id SMALLINT UNSIGNED NOT NULL, station_id TINYINT UNSIGNED NOT NULL, departure_time TIME NOT NULL, INDEX(train_id), INDEX(station_id), INDEX(departure_time) );
Once we enter information into these tables, we can start to perform sophisticated queries. For example, which trains arrive at “Tel Aviv Central” before 8 a.m.?
SELECT train_id FROM DepartureTimes WHERE departure_time < "08:00" AND station_id = 4;Sure enough, this query returns a table containing two rows:
+----------+ | train_id | +----------+ | 1 | | 2 | +----------+Now we know two trains will arrive in Tel Aviv early enough for us to catch a morning meeting. But which trains are those? It would be nice to get more information than that. One possibility is to print the name of the origin station and the hour at which the train leaves:
SELECT S.name, T.depart_origin_time FROM DepartureTimes DT, Trains T, RailStations S WHERE DT.departure_time < "08:00" AND DT.station_id = 4 AND DT.train_id = T.id AND S.id = T.origin_id;Notice how SQL allows us to use < and > when handling dates and times, for columns declared as DATE, TIME or DATETIME. Given the contortions one must use in order to compare dates and times in nearly any programming language, this built-in date comparison is still one of my favorites.
Assuming we want to take the first train of the day (ID 1), we can print the list of when it will arrive at each station:
SELECT T.id, S.name, DT.departure_time FROM RailStations S, DepartureTimes DT, Trains T, StationLines SL WHERE T.id = DT.train_id AND T.id = 1 AND T.line_id = SL.line_id AND SL.station_id = DT.station_id AND DT.station_id = S.id ORDER BY T.id, SL.north_to_south ;
We can even print a full schedule for trains to Tel Aviv (ID 5):
SELECT T.id, S.name, DT.departure_time FROM RailStations S, DepartureTimes DT, Trains T, StationLines SL WHERE T.id = DT.train_id AND T.line_id = SL.line_id AND SL.station_id = DT.station_id AND DT.station_id = S.id AND T.destination_id = 5 ORDER BY T.id, SL.north_to_south ;Finally, we can retrieve a full schedule for trains to Tel Aviv (ID 5) that leave after 9 AM:
SELECT T.id, S.name, DT.departure_time FROM RailStations S, DepartureTimes DT, Trains T, StationLines SL WHERE T.id = DT.train_id AND T.line_id = SL.line_id AND SL.station_id = DT.station_id AND DT.station_id = S.id AND T.destination_id = 5 AND T.depart_origin_time > "09:00" ORDER BY T.id, SL.north_to_south ;
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
- Dynamic DNS—an Object Lesson in Problem Solving
- Using Salt Stack and Vagrant for Drupal Development
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- New Products
- RSS Feeds
- Dart: a New Web Programming Experience
- Resources for “Analysis of the HTB Queuing Discipline”
- Parallel Programming with NVIDIA CUDA
- Developer Poll
- Pluggable Authentication Modules for Linux
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




1 hour 14 min ago
3 hours 5 min ago
8 hours 18 min ago
11 hours 29 min ago
13 hours 45 min ago
14 hours 13 min ago
15 hours 11 min ago
16 hours 40 min ago
17 hours 49 min ago
18 hours 35 min ago