Open-Source Databases, Part III: Choosing a Database
If you are an application developer, you're probably working with large quantities of data. And, if that data is anything more complex than a hash table, you might want to consider moving some or all of it into a relational database. Relational databases are designed for reliable and flexible retrieval of data. The magic of a relational database is not the use of two-dimensional tables to store all of the information, but it's the fact that tables can be combined in many different ways and manipulated using the SQL query language.
As we saw in my database articles in the last two issues of LJ, open-source programmers are fortunate enough to have several database options at their disposal. By far, the two most popular open-source relational databases are MySQL and PostgreSQL. Each has a large and loyal following, and each continues to improve with every successive version.
And, when I write “large and loyal following” above, I'm not kidding. MySQL and PostgreSQL have long been at the center of a major flame war within the Open Source world. If someone on Slashdot dares say something about one of these products, you can be sure it won't be long before someone writes a nasty (and often childish) note about the other one. These disagreements often reflect the knee-jerk attitudes of uninformed users, but there have been no shortage of attacks from well-known and informed users of these products as well.
I believe there are circumstances when either MySQL or PostgreSQL might be an appropriate choice. I've strongly preferred PostgreSQL in my work during the last decade—yet, there definitely are times when MySQL seems to be the more appropriate solution.
So, despite my personal biases and the risk of opening a flame war within the Open Source community, I now conclude this series about open-source databases with a comparison between MySQL and PostgreSQL in a number of different categories. I hope by the time you finish reading this article, you understand that choosing a database is almost never a matter of finding the “fastest” or “best” product, because there is no one way to measure the quality or appropriateness of a relational database server. Rather, I hope you'll be able to consider each of these on the basis of its own merits, rather than on the propaganda that is so widespread.
Perhaps the first and foremost task of a database is to store and retrieve data reliably. Just as you wouldn't want to use a hard disk that occasionally loses data, you don't want to put things into a database that occasionally mangles its contents. This is true even if the reliability comes at the expense of speed.
The gold standard for reliability in the database world has an acronym, ACID (Atomicity, Consistency, Isolation and Durability). This means that under all circumstances in the database, the following hold true:
Atomicity: each query is guaranteed to complete or not, without any possibility of halfway or incomplete states.
Consistency: the database is always in a legal state before and after a transaction.
Isolation: each transaction occurs separately from other actions, so that you can't have two transactions interfering with one another.
Durability: transactions persist over time, typically by being stored on a filesystem.
The attitude toward ACID within the PostgreSQL community has been unchanged since I first started to use it a decade ago, placing it as the highest possible priority. This doesn't mean PostgreSQL is lacking in other features, but rather it means the developers have worked to ensure that data stored in a PostgreSQL system will be consistent and reliable, even if you do nasty things such as issue a kill -9 or pull the plug.
During the past few years, PostgreSQL has begun to offer even better support for transactions and database stability, using write-ahead logs (WALs) that describe each action taken by the database. These WAL files can be used to recover from a disaster or even to recover the database to an earlier point in its history—a feature known as point-in-time recovery (PITR). Thus, if you know something happened yesterday, but the database was working perfectly two days ago, you could use PITR to recover to the earlier, stable state. Recent versions of PostgreSQL also support two-phased commit, a type of transaction you're likely to see in a distributed system where multiple servers must coordinate their actions.
MySQL has had a mixed attitude toward ACID during the years. When I first started to use MySQL in 1995, the authors' attitude was that transactions should be handled by the application, not the database. Indeed, as recently as 2000, the to-do list for MySQL included tasks having to do with production-quality transaction-safe tables. This has led to a great deal of bad blood between the MySQL and PostgreSQL communities, with members of the latter sometimes claiming that no critical data should ever be stored in MySQL.
The good news is that modern versions of MySQL do indeed support transaction-safe tables, using InnoDB, a third-party product released under the GPL that has been integrated into MySQL for several years. Moreover, InnoDB appears to use techniques that PostgreSQL and Oracle have used for years, such as MVCC (multi-version concurrency control). The bad news is that at least some benchmarks I've seen indicate that InnoDB has some problems scaling to large numbers of simultaneous queries.
In addition, the company that develops InnoDB recently was bought by Oracle, which might lead some people to worry about future licensing, development and pricing issues. For the time being, this latter issue does not appear to be a serious one, because Oracle and MySQL signed a contract in 2006 extending the licensing for InnoDB. But, MySQL does not appear to be taking any chances and has hired several experts to create a new table structure that will be owned by MySQL and thus be impervious to such business problems.
I'm personally of the persuasion that true ACID compliance is always a good thing to have around, much like seat belts in a car. Sure, you can drive without a seat belt, and the odds are that nothing will happen to you. But, it's impossible to predict when something bad might happen, and you really don't want to be without a seat belt under such circumstances. In the same way, if your data is important to you, it's best to ensure that it will persist with integrity.
A related problem has to do with the degree to which each database enforces constraints and limits. PostgreSQL tends to be quite stringent on such matters, refusing to accept illegal data. MySQL tries to be more forgiving and flexible, but that can result in strange and illegal data being stored.
For example, consider the following set of MySQL commands, in which we create a table foo with a single column (named a) of type DATE:
mysql> CREATE TABLE foo (a date); Query OK, 0 rows affected (0.08 sec) mysql> INSERT INTO foo (a) VALUES ('2007-feb-30'); Query OK, 1 row affected, 1 warning (0.00 sec) mysql> SELECT * FROM foo; +------------+ | a | +------------+ | 0000-00-00 | +------------+ 1 row in set (0.00 sec)
By contrast, this is what happens in PostgreSQL:
atf=# CREATE TABLE foo (a date); CREATE TABLE atf=# \d foo Table "public.foo" Column | Type | Modifiers --------+------+----------- a | date | atf=# INSERT INTO foo (a) VALUES ('2007-feb-30'); ERROR: date/time field value out of range: "2007-feb-30"
It is possible to configure MySQL to be more strict on such issues, but most users will not think to do so and will be stuck with illegal values in their tables.
Given the political and technical issues at MySQL, as well as the weird (and potentially dangerous) default behavior in MySQL, I believe that PostgreSQL has a big edge on issues of data integrity.
- Geek Guide: The DevOps Toolbox
- Nmap—Not Just for Evil!
- Download "The DevOps Toolbox: Tools and Technologies for Scale and Reliability"
- High-Availability Storage with HA-LVM
- Resurrecting the Armadillo
- March 2015 Issue of Linux Journal: System Administration
- Real-Time Rogue Wireless Access Point Detection with the Raspberry Pi
- DNSMasq, the Pint-Sized Super Dæmon!
- Localhost DNS Cache
- Days Between Dates: the Counting