Relational databases are becoming increasingly popular for web applications. This is generally a good thing, allowing us to focus on the way in which our data is structured, rather than the way it is stored on disk. Offloading data storage and retrieval tasks to a relational database server means our programs can be smaller and easier to maintain.
However, incorporating a database server on a web site is not a cure-all. The database might take care of many necessary tasks, but it cannot design your tables for you, nor determine the best way in which to work with them.
This month, we will look at the art of database design and how we can structure tables to improve performance. Getting the most out of a database is something of a black art, which is why good database administrators (DBAs) are always in high demand. But with a few simple techniques, we can overcome some of the most basic performance problems experienced by web programmers. We will design a database that can handle train schedules. In so doing, we will touch on a number of issues database programmers face when deciding how to design tables in a database.
I love trains, and often take the train between Tel Aviv and Haifa when I must travel between those two cities. One day, after leafing through the small paper schedule that Israel's Rail Authority distributes, I realized the implementation of a computerized train schedule is not as obvious as it would appear at first.
Rail schedules often come in the form of a printed table, with the columns representing individual trains and each station in its own row. Each table lists trains on one rail line, in a single direction.
Since relational databases store all of their data in tables, you might think this is a perfect way in which to store the information. In order to allow us to add and delete trains more easily, we will swap the axes from the printed schedule, putting the individual trains in the rows and the stations in the columns.
To define such a table in SQL, we could use a query like this:
CREATE TABLE HaifaToTelAviv ( haifa_central TIME NOT NULL, haifa_bat_galim TIME NOT NULL, binyamina TIME NOT NULL, hof_hacarmel TIME NOT NULL, ta_central TIME NOT NULL, ta_hashalom TIME NOT NULL );
Given such a table, we could enter our trains as follows:
INSERT INTO HaifaToTelAviv (haifa_central, haifa_bat_galim, binyamina, hof_hacarmel, ta_central, ta_hashalom) VALUES ("12:05", "12:10", "12:17", "12:37", "13:16", "13:21");If you have any experience with databases, you can quickly see the terrible problems in store for us here. For starters, what happens if a new station is built between Haifa and Tel Aviv? That would require us to redefine our table, adding a new column, and that's only the beginning. It is a bit absurd that each train line requires two tables, one for each direction. And there isn't any way for me to determine whether a particular rail line serves any two cities—if the cities are represented by columns. What can we do about Tel Aviv? If two cities are close to each other and I can take a train to either one, I will have to query two tables in order to find the answer.
In addition, trying to query information from the above HaifaToTelAviv table would be difficult, requiring us to know the name of the column corresponding to each station. The problems just continue from there—for instance, what do we enter if the express train passes Binyamina? We could define the “binyamina” column to be NULL and enter a NULL value in that column. However, NULL normally indicates that a value is unknown or missing, whereas the reason in this case is much simpler.
Finally, what happens if a new schedule comes out, making each train later by a different amount of time? Editing the schedule in this format would be quite difficult.
How should we model the train schedule, then, if we cannot do so from the printed schedule? The solution is to break the information into smaller tables, bringing them together to answer questions. Relational databases specialize in this sort of operation, allowing us to “join” two or more tables together.
Breaking the single large table into many smaller tables makes the database more flexible, allowing us to ask many more questions than would otherwise be possible. For example, we should be able to ask questions like:
What is the last train from Haifa that will arrive in Tel Aviv before 11:00 a.m.?
Are there any express trains from Tel Aviv to Haifa?
What time will the 10:00 a.m. train from Binyamina get to Tel Aviv?
If we model our data correctly, breaking it down into sufficiently small and flexible tables, it should be possible to answer any of these questions with a single SQL query.
These examples all use MySQL, a “mostly free” database popular with many web sites. MySQL lacks some of the advanced features of other databases, such as transactions and referential integrity. However, it is easy to install and administer and is extremely fast. You can learn more about MySQL at http://www.mysql.com/.
For example, here is a definition of the RailStations table:
CREATE TABLE RailStations ( id TINYINT UNSIGNED AUTO_INCREMENT PRIMARY KEY, name VARCHAR(50) NOT NULL, UNIQUE(name) );
The only reason for RailStations to exist is to associate a numeric ID with each station. It might seem silly to create such a table, when we could enter station names directly wherever we need them.
However, giving each station an ID number gives us two advantages. First of all, we can be sure the station names will be spelled consistently, without variations in spelling, capitalization and abbreviations. Second, an integer consumes less space than the name to which it points. Each tinyint consumes a single byte, whereas a 20-character station name will consume 20 bytes. Referring to the full name would thus consume 20 times as much RAM and disk space.
Notice that we define id to be a column of type TINYINT UNSIGNED. This allows us to assign values between 0 and 255. Large rail systems, with more than 255 stations, would need to use a SMALLINT UNSIGNED, which ranges between 0 and 65535.
We ensure each station name in RailStations is unique by giving it the UNIQUE qualifier. The ID numbers are already guaranteed to be unique because they have been declared the primary key. Better yet, because we specified AUTO_INCREMENT, MySQL will automatically assign an ID number if an INSERT query ignores it. For example:
INSERT INTO RailStations (name) VALUES ("Nahariya");
If we now query the database:
SELECT id FROM RailStations WHERE name = "Nahariya";we learn that Nahariya has been automatically assigned an ID of 1.
We can insert one or more new rows into the table with a single INSERT statement. For example, the following adds several more rows to RailStations:
INSERT INTO RailStations (name) VALUES ("Akko"), ("Hof Hacarmel"), ("Tel Aviv Central"), ("Tel Aviv Hashalom"), ("Lod"), ("Rehovot"), ("Herzliya") ;
|Dynamic DNS—an Object Lesson in Problem Solving||May 21, 2013|
|Using Salt Stack and Vagrant for Drupal Development||May 20, 2013|
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
|Non-Linux FOSS: Seashore||May 10, 2013|
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- Drupal Is a Framework: Why Everyone Needs to Understand This
- Validate an E-Mail Address with PHP, the Right Way
- A Topic for Discussion - Open Source Feature-Richness?
- New Products
- Download the Free Red Hat White Paper "Using an Open Source Framework to Catch the Bad Guy"
- The Secret Password Is...
3 hours 39 min ago
- Keeping track of IP address
5 hours 30 min ago
- Roll your own dynamic dns
10 hours 43 min ago
- Please correct the URL for Salt Stack's web site
13 hours 55 min ago
- Android is Linux -- why no better inter-operation
16 hours 10 min ago
- Connecting Android device to desktop Linux via USB
16 hours 39 min ago
- Find new cell phone and tablet pc
17 hours 37 min ago
19 hours 6 min ago
- Automatically updating Guest Additions
20 hours 14 min ago
- I like your topic on android
21 hours 1 min ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi
It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?