Relational databases are becoming increasingly popular for web applications. This is generally a good thing, allowing us to focus on the way in which our data is structured, rather than the way it is stored on disk. Offloading data storage and retrieval tasks to a relational database server means our programs can be smaller and easier to maintain.
However, incorporating a database server on a web site is not a cure-all. The database might take care of many necessary tasks, but it cannot design your tables for you, nor determine the best way in which to work with them.
This month, we will look at the art of database design and how we can structure tables to improve performance. Getting the most out of a database is something of a black art, which is why good database administrators (DBAs) are always in high demand. But with a few simple techniques, we can overcome some of the most basic performance problems experienced by web programmers. We will design a database that can handle train schedules. In so doing, we will touch on a number of issues database programmers face when deciding how to design tables in a database.
I love trains, and often take the train between Tel Aviv and Haifa when I must travel between those two cities. One day, after leafing through the small paper schedule that Israel's Rail Authority distributes, I realized the implementation of a computerized train schedule is not as obvious as it would appear at first.
Rail schedules often come in the form of a printed table, with the columns representing individual trains and each station in its own row. Each table lists trains on one rail line, in a single direction.
Since relational databases store all of their data in tables, you might think this is a perfect way in which to store the information. In order to allow us to add and delete trains more easily, we will swap the axes from the printed schedule, putting the individual trains in the rows and the stations in the columns.
To define such a table in SQL, we could use a query like this:
CREATE TABLE HaifaToTelAviv ( haifa_central TIME NOT NULL, haifa_bat_galim TIME NOT NULL, binyamina TIME NOT NULL, hof_hacarmel TIME NOT NULL, ta_central TIME NOT NULL, ta_hashalom TIME NOT NULL );
Given such a table, we could enter our trains as follows:
INSERT INTO HaifaToTelAviv (haifa_central, haifa_bat_galim, binyamina, hof_hacarmel, ta_central, ta_hashalom) VALUES ("12:05", "12:10", "12:17", "12:37", "13:16", "13:21");If you have any experience with databases, you can quickly see the terrible problems in store for us here. For starters, what happens if a new station is built between Haifa and Tel Aviv? That would require us to redefine our table, adding a new column, and that's only the beginning. It is a bit absurd that each train line requires two tables, one for each direction. And there isn't any way for me to determine whether a particular rail line serves any two cities—if the cities are represented by columns. What can we do about Tel Aviv? If two cities are close to each other and I can take a train to either one, I will have to query two tables in order to find the answer.
In addition, trying to query information from the above HaifaToTelAviv table would be difficult, requiring us to know the name of the column corresponding to each station. The problems just continue from there—for instance, what do we enter if the express train passes Binyamina? We could define the “binyamina” column to be NULL and enter a NULL value in that column. However, NULL normally indicates that a value is unknown or missing, whereas the reason in this case is much simpler.
Finally, what happens if a new schedule comes out, making each train later by a different amount of time? Editing the schedule in this format would be quite difficult.
How should we model the train schedule, then, if we cannot do so from the printed schedule? The solution is to break the information into smaller tables, bringing them together to answer questions. Relational databases specialize in this sort of operation, allowing us to “join” two or more tables together.
Breaking the single large table into many smaller tables makes the database more flexible, allowing us to ask many more questions than would otherwise be possible. For example, we should be able to ask questions like:
What is the last train from Haifa that will arrive in Tel Aviv before 11:00 a.m.?
Are there any express trains from Tel Aviv to Haifa?
What time will the 10:00 a.m. train from Binyamina get to Tel Aviv?
If we model our data correctly, breaking it down into sufficiently small and flexible tables, it should be possible to answer any of these questions with a single SQL query.
These examples all use MySQL, a “mostly free” database popular with many web sites. MySQL lacks some of the advanced features of other databases, such as transactions and referential integrity. However, it is easy to install and administer and is extremely fast. You can learn more about MySQL at http://www.mysql.com/.
For example, here is a definition of the RailStations table:
CREATE TABLE RailStations ( id TINYINT UNSIGNED AUTO_INCREMENT PRIMARY KEY, name VARCHAR(50) NOT NULL, UNIQUE(name) );
The only reason for RailStations to exist is to associate a numeric ID with each station. It might seem silly to create such a table, when we could enter station names directly wherever we need them.
However, giving each station an ID number gives us two advantages. First of all, we can be sure the station names will be spelled consistently, without variations in spelling, capitalization and abbreviations. Second, an integer consumes less space than the name to which it points. Each tinyint consumes a single byte, whereas a 20-character station name will consume 20 bytes. Referring to the full name would thus consume 20 times as much RAM and disk space.
Notice that we define id to be a column of type TINYINT UNSIGNED. This allows us to assign values between 0 and 255. Large rail systems, with more than 255 stations, would need to use a SMALLINT UNSIGNED, which ranges between 0 and 65535.
We ensure each station name in RailStations is unique by giving it the UNIQUE qualifier. The ID numbers are already guaranteed to be unique because they have been declared the primary key. Better yet, because we specified AUTO_INCREMENT, MySQL will automatically assign an ID number if an INSERT query ignores it. For example:
INSERT INTO RailStations (name) VALUES ("Nahariya");
If we now query the database:
SELECT id FROM RailStations WHERE name = "Nahariya";we learn that Nahariya has been automatically assigned an ID of 1.
We can insert one or more new rows into the table with a single INSERT statement. For example, the following adds several more rows to RailStations:
INSERT INTO RailStations (name) VALUES ("Akko"), ("Hof Hacarmel"), ("Tel Aviv Central"), ("Tel Aviv Hashalom"), ("Lod"), ("Rehovot"), ("Herzliya") ;
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Tips for Optimizing Linux Memory Usage
- Secure Desktops with Qubes: Introduction
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- Working with Command Arguments
- Fancy Tricks for Changing Numeric Base
- Secure Desktops with Qubes: Installation
- CentOS 6.8 Released
- Linux Mint 18
- Seeing Red and Getting Sleep
- The Italian Army Switches to LibreOffice
Until recently, IBM’s Power Platform was looked upon as being the system that hosted IBM’s flavor of UNIX and proprietary operating system called IBM i. These servers often are found in medium-size businesses running ERP, CRM and financials for on-premise customers. By enabling the Power platform to run the Linux OS, IBM now has positioned Power to be the platform of choice for those already running Linux that are facing scalability issues, especially customers looking at analytics, big data or cloud computing.
￼Running Linux on IBM’s Power hardware offers some obvious benefits, including improved processing speed and memory bandwidth, inherent security, and simpler deployment and management. But if you look beyond the impressive architecture, you’ll also find an open ecosystem that has given rise to a strong, innovative community, as well as an inventory of system and network management applications that really help leverage the benefits offered by running Linux on Power.Get the Guide