Relational databases are becoming increasingly popular for web applications. This is generally a good thing, allowing us to focus on the way in which our data is structured, rather than the way it is stored on disk. Offloading data storage and retrieval tasks to a relational database server means our programs can be smaller and easier to maintain.
However, incorporating a database server on a web site is not a cure-all. The database might take care of many necessary tasks, but it cannot design your tables for you, nor determine the best way in which to work with them.
This month, we will look at the art of database design and how we can structure tables to improve performance. Getting the most out of a database is something of a black art, which is why good database administrators (DBAs) are always in high demand. But with a few simple techniques, we can overcome some of the most basic performance problems experienced by web programmers. We will design a database that can handle train schedules. In so doing, we will touch on a number of issues database programmers face when deciding how to design tables in a database.
I love trains, and often take the train between Tel Aviv and Haifa when I must travel between those two cities. One day, after leafing through the small paper schedule that Israel's Rail Authority distributes, I realized the implementation of a computerized train schedule is not as obvious as it would appear at first.
Rail schedules often come in the form of a printed table, with the columns representing individual trains and each station in its own row. Each table lists trains on one rail line, in a single direction.
Since relational databases store all of their data in tables, you might think this is a perfect way in which to store the information. In order to allow us to add and delete trains more easily, we will swap the axes from the printed schedule, putting the individual trains in the rows and the stations in the columns.
To define such a table in SQL, we could use a query like this:
CREATE TABLE HaifaToTelAviv ( haifa_central TIME NOT NULL, haifa_bat_galim TIME NOT NULL, binyamina TIME NOT NULL, hof_hacarmel TIME NOT NULL, ta_central TIME NOT NULL, ta_hashalom TIME NOT NULL );
Given such a table, we could enter our trains as follows:
INSERT INTO HaifaToTelAviv (haifa_central, haifa_bat_galim, binyamina, hof_hacarmel, ta_central, ta_hashalom) VALUES ("12:05", "12:10", "12:17", "12:37", "13:16", "13:21");If you have any experience with databases, you can quickly see the terrible problems in store for us here. For starters, what happens if a new station is built between Haifa and Tel Aviv? That would require us to redefine our table, adding a new column, and that's only the beginning. It is a bit absurd that each train line requires two tables, one for each direction. And there isn't any way for me to determine whether a particular rail line serves any two cities—if the cities are represented by columns. What can we do about Tel Aviv? If two cities are close to each other and I can take a train to either one, I will have to query two tables in order to find the answer.
In addition, trying to query information from the above HaifaToTelAviv table would be difficult, requiring us to know the name of the column corresponding to each station. The problems just continue from there—for instance, what do we enter if the express train passes Binyamina? We could define the “binyamina” column to be NULL and enter a NULL value in that column. However, NULL normally indicates that a value is unknown or missing, whereas the reason in this case is much simpler.
Finally, what happens if a new schedule comes out, making each train later by a different amount of time? Editing the schedule in this format would be quite difficult.
How should we model the train schedule, then, if we cannot do so from the printed schedule? The solution is to break the information into smaller tables, bringing them together to answer questions. Relational databases specialize in this sort of operation, allowing us to “join” two or more tables together.
Breaking the single large table into many smaller tables makes the database more flexible, allowing us to ask many more questions than would otherwise be possible. For example, we should be able to ask questions like:
What is the last train from Haifa that will arrive in Tel Aviv before 11:00 a.m.?
Are there any express trains from Tel Aviv to Haifa?
What time will the 10:00 a.m. train from Binyamina get to Tel Aviv?
If we model our data correctly, breaking it down into sufficiently small and flexible tables, it should be possible to answer any of these questions with a single SQL query.
These examples all use MySQL, a “mostly free” database popular with many web sites. MySQL lacks some of the advanced features of other databases, such as transactions and referential integrity. However, it is easy to install and administer and is extremely fast. You can learn more about MySQL at http://www.mysql.com/.
For example, here is a definition of the RailStations table:
CREATE TABLE RailStations ( id TINYINT UNSIGNED AUTO_INCREMENT PRIMARY KEY, name VARCHAR(50) NOT NULL, UNIQUE(name) );
The only reason for RailStations to exist is to associate a numeric ID with each station. It might seem silly to create such a table, when we could enter station names directly wherever we need them.
However, giving each station an ID number gives us two advantages. First of all, we can be sure the station names will be spelled consistently, without variations in spelling, capitalization and abbreviations. Second, an integer consumes less space than the name to which it points. Each tinyint consumes a single byte, whereas a 20-character station name will consume 20 bytes. Referring to the full name would thus consume 20 times as much RAM and disk space.
Notice that we define id to be a column of type TINYINT UNSIGNED. This allows us to assign values between 0 and 255. Large rail systems, with more than 255 stations, would need to use a SMALLINT UNSIGNED, which ranges between 0 and 65535.
We ensure each station name in RailStations is unique by giving it the UNIQUE qualifier. The ID numbers are already guaranteed to be unique because they have been declared the primary key. Better yet, because we specified AUTO_INCREMENT, MySQL will automatically assign an ID number if an INSERT query ignores it. For example:
INSERT INTO RailStations (name) VALUES ("Nahariya");
If we now query the database:
SELECT id FROM RailStations WHERE name = "Nahariya";we learn that Nahariya has been automatically assigned an ID of 1.
We can insert one or more new rows into the table with a single INSERT statement. For example, the following adds several more rows to RailStations:
INSERT INTO RailStations (name) VALUES ("Akko"), ("Hof Hacarmel"), ("Tel Aviv Central"), ("Tel Aviv Hashalom"), ("Lod"), ("Rehovot"), ("Herzliya") ;
Practical Task Scheduling Deployment
July 20, 2016 12:00 pm CDT
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.Register Now!
- SUSE LLC's SUSE Manager
- My +1 Sword of Productivity
- Tech Tip: Really Simple HTTP Server with Python
- Managing Linux Using Puppet
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- Non-Linux FOSS: Caffeine!
- Returning Values from Bash Functions
- Rogue Wave Software's Zend Server
- Doing for User Space What We Did for Kernel Space
- Parsing an RSS News Feed with a Bash Script
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide