At the Forge - PostgreSQL 9.0
Until just a few years ago, anyone using the term database almost always was referring to a relational database—that is, one based on linked two-dimensional tables, which could be queried using the relatively standard SQL language. Sure, there were a variety of other types of databases, but they weren't in widespread use or taken very seriously.
In the past few years, however, there has been tremendous growth in the use and mindshare of so-called NoSQL databases. There's not much in common between these various databases—from Redis to MongoDB to CouchDB to Hadoop—that offer the promise of greater scalability and flexibility.
Now, I'm not one to say these non-relational databases are bad. On the contrary, I've used several of them in my own projects and found them to be a perfect fit for certain needs. But for day-to-day use, I continue to find relational databases an excellent fit. And when I need a relational database, I always prefer to use PostgreSQL. Indeed, when I'm working on a Web application, PostgreSQL is the default option. Only if I see that it won't work well (for whatever reason) do I switch some or all of the project to use a different database.
Why do I love PostgreSQL so much? First, because it's rock solid. The developers have an almost obsessive dedication to the safety of the data you store in the database, and they do their best to ensure that data won't ever be deleted or corrupted. This doesn't mean such situations never happen, but they're quite rare. PostgreSQL not only tries to ensure the safety of your data, but it also provides you with a variety of tools to help you constrain the values that may be stored in the database, ensuring that anything stored is guaranteed to be valid.
Second, PostgreSQL offers a wealth of features that never cease to impress and amaze me. Whether it's the number of languages in which you can write server-side functions or the different ways you can define indexes, the clever use of MVCC (multi-version concurrency control) to avoid writers blocking readers or the ongoing statistical analysis that feeds into the query optimizer, PostgreSQL hasn't failed me in more than 15 years of day-to-day use.
Every time I use another relational database, I find myself wishing for one or more functions that PostgreSQL provides or thinking that PostgreSQL wouldn't even enter into certain problematic situations, because of its high-quality architecture. This doesn't mean PostgreSQL is perfect, but its mix of features has served me quite well, and I'm not the only one to feel this way.
This month, I want to look at PostgreSQL 9.0, which was released in late 2010. In particular, I want to consider what advantages it offers over other open-source databases (mainly MySQL). For people already using PostgreSQL, what new features does the 9.0 release bring to the table?
I won't describe installing PostgreSQL, simply because versions are available for all major Linux distributions. Downloading and compiling the source code is relatively straightforward—although doing so as the dedicated “postgres” user, rather than as root, will save you many headaches during the installation process—but I have found binaries to be just fine for my day-to-day needs.
PostgreSQL is an open-source (BSD-licensed) relational database developed by a core group of developers working for a number of different companies. (Indeed, the rules of the core development group forbid more than a handful of core developers from working for the same company to ensure that no one organization controls development directly.) It grew out of the Postgres Project developed by Michael Stonebreaker at the University of California, Berkeley. Postgres itself was an attempt to improve upon the original Ingres database Stonebreaker had developed and commercialized.
Postgres used its own query language, known as QUEL, and had a number of advanced ideas, including many taken from the world of object-oriented programming. Several developers joined forces to take the Postgres database, replace QUEL with SQL, stabilize the code and release it under an open-source license. The first release of PostgreSQL, under its new name, came in 1996. Informally, a large number of PostgreSQL users and developers continue to call it Postgres, although that technically refers to Stonebreaker's project at Berkeley and not the current implementation.
One of the main goals of the PostgreSQL team has been to adhere to the SQL standard wherever possible. In addition, as I mentioned previously, PostgreSQL developers pride themselves on a rock-solid implementation, using the standard ACID (atomicity-consistency-isolation-durability) paradigm for database storage under all circumstances. They further try to balance between a powerful set of features and configurable options and overwhelming people who don't want to be full-time database administrators.
All PostgreSQL storage is transactional, using a system known as MVCC (multi-version concurrency control). MVCC, which also is used in Oracle and CouchDB, largely avoids conflicts between readers and writers, ensuring that neither rows nor tables need to be locked under most circumstances. MVCC does have the side effect of keeping “dead rows” around in the database, which occasionally need to be “vacuumed”, similar to garbage collection in many programming languages. For many years, vacuuming needed to be done manually, typically by running the vacuum command from a cron job. Nowadays, the autovacuum dæmon runs in the background, marking space as reusable when added or updated rows reach a configurable threshold.
Vacuum also can be run in “analyze” mode, in which case it examines tables and indexes, updating the statistics that are used when planning queries. This is one of the reasons PostgreSQL is able to operate so efficiently, even when handed complex queries. By keeping an up-to-date description of data stored in each table, it can make an intelligent decision regarding how to optimize the query. If the guesses turn out to be wrong for your particular needs, you can configure some configuration parameters.
PostgreSQL offers strong support for foreign keys, making it possible to ensure that connections between tables exist and work in the right way. That is, you can define a People table and then an Addresses table that refers to it:
CREATE TABLE People ( id SERIAL, first_name TEXT, last_name TEXT, email_address TEXT, PRIMARY KEY(id) ); CREATE TABLE Addresses ( id SERIAL, person_id INTEGER REFERENCES People, address TEXT, PRIMARY KEY(id) );
Now, let's try to insert a new record into the Addresses table, without inserting anything into People first:
INSERT INTO Addresses (person_id) VALUES (5);
This results in the following:
ERROR: insert or update on table "addresses" violates foreign key ↪constraint "addresses_person_id_fkey" DETAIL: Key (person_id)=(5) is not present in table "people".
Because I have defined person_id as a foreign key, PostgreSQL won't let me have an invalid value there, period. PostgreSQL also will refuse to let you remove a record from the People table if there are rows in the Addresses table referring to it. You can override such restrictions with the CASCADE keyword, but the database never will remove things on its own without your explicit approval.
You can, of course, also ensure that e-mail addresses are unique in your People table:
CREATE UNIQUE INDEX people_email_idx ON People(email_address);
But wait, what if someone enters an e-mail address in capital letters and then enters the same e-mail address in lowercase letters? You can ensure uniqueness by taking advantage of one of my favorite PostgreSQL features, a functional index:
CREATE UNIQUE INDEX people_email_idx ON ↪People(lower(email_address));
Now PostgreSQL will ensure its uniqueness, regardless of the case. This feature also comes in handy if you're trying to index a column containing long text strings. You can index only the first 1,000 characters:
CREATE UNIQUE INDEX people_email_idx ON ↪People(substring(email_address, 1, 1000));
Finally, another powerful feature is CTEs (common table expressions). If you often find yourself performing a query, putting the results in a temporary table and then querying that temp table, CTEs almost certainly are something you'll want to use. You basically describe the temporary table query, give it a name using the “with” statement, and then query that pseudo table. For example:
WITH Mini_users AS (SELECT id * 2 AS id2, email FROM Users) SELECT id2 FROM Mini_users;
In the example above, I create a new temporary table named mini_users and define it, and then query it as if it already existed. I already have found CTEs to be quite useful in simplifying complex queries. PostgreSQL also makes it easy to define recursive CTEs, allowing you to work with hierarchical data with a single, simple query.
One of the features I have grown to love over the years is the ability to write my own server-side functions—analogous to “stored procedures” in other databases—in a variety of languages. Most of the time, I use the built-in Pl/PgSQL language, which marries SQL queries with a procedural language. But in cases when I need a bit more power, or want to use existing libraries, I can write functions in PL/Perl, PL/Python or a number of other languages. Regardless of what language I use, these functions integrate seamlessly into the database, working no differently from built-in functions and allowing me to process data inside the database, rather than in my application.
Over time, these server-side languages have become increasingly sophisticated, and although the Pl/PgSQL syntax is not the most modern or expressive, it is easy to learn and handles a large number of cases easily and quickly. I've seen a wide variety of uses for server-side functions, from encapsulating most application logic in such functions to handling specific data-manipulation routines that don't logically belong in the application layer.
One of my favorite uses for functions is in triggers—actions the database executes automatically when a certain action occurs. For example, I once worked on a project into which we wanted to insert a URL, but in which we also wanted to have a (separate) column containing the hostname for that URL. I wrote a function that used regular expressions to retrieve the hostname and then inserted the hostname automatically into the appropriate column. From the application's perspective, it was able to insert a URL but then retrieve either a URL or a hostname. Triggers can be used to perform all sorts of actions before or after an insert, delete or update query is performed on a table.
One of the many advances in the 9.0 release was the improvement of “window functions”, functions introduced in 8.4 that make it possible to create aggregations (such as sum and average) over a portion of other rows in the table, rather than over the entire table. Thus, you could calculate the difference between someone's salary and other salaries in that person's department, or rank the number of citations a department has in a bibliographic index, or find which was the longest-running action in each controller, within an MVC Web application. I've only started to work with windowing functions, but the combination of built-in functionality from 8.4, plus some additions to 9.0 that make it easy to peek at preceding and following rows, lead me to believe this is a particularly powerful feature.
The above features are nice improvements, but they're icing on the cake when it comes to reasons for a full-version upgrade to 9.0.
First, this version makes it fairly simple to upgrade. Earlier versions required dumping the database to disk, upgrading the server, and then restoring the database from its dumped form. This might not be a big problem for small databases that can afford to be off-line for a while, but it is a major issue for large sites that cannot afford such a long shutdown period.
The core PostgreSQL developers heard the criticism and have responded with pg_upgrade. Now, pg_upgrade still is considered to be somewhat experimental, and it hasn't been given the official seal of approval by the development team, so it is placed in the contrib directory, rather than in any official location. However, pg_upgrade, which allows you to upgrade your PostgreSQL server without a dump or restore, is certainly one of the major innovations and accomplishments of 9.0, and it has been greeted with great excitement by people who were too nervous or busy to upgrade previously.
Another major feature—and perhaps the most impressive accomplishment of this version—is what's known as “hot streaming replication”. This feature is actually the combination of two different features, which work together to make for an amazing new backup and high-availability system.
The basic idea is as follows. When you commit a transaction to PostgreSQL, it doesn't immediately update the tables on disk. Rather, it writes a record of the transaction to a separate “write-ahead log”, or WAL, describing the change that should be made to the database. Only after enough WALs have accumulated does PostgreSQL actually update the data.
Starting in version 8.4, you could copy the WALs from one PostgreSQL server to another, typically using a program like rsync. When the WALs appeared on the second server, they were read into that system. If the first server were ever to go down, the second server could be put into active mode, picking up from where the first server had left off.
Although this was better than nothing, it had at least two problems. Not only was transferring the WALs via “log shipping” far from an ideal mechanism, but the secondary server could not be used while it was receiving data from the primary server.
Both of these problems have gone away in PostgreSQL 9.0, providing a robust and easy-to-configure master-slave mechanism for high availability. WALs now can be transferred from the primary server to the secondary using the network by configuring the secondary as a special kind of client program. As each transaction occurs, the WAL can be sent over the network to one or more remote servers, allowing them to be synchronized almost immediately with the primary. The other improvement is that the secondary server can answer read-only queries even while it's receiving data from the primary.
Hot streaming replication, as the combination of these two features is known, doesn't cover all situations. For example, I recently was working with a company that has multiple PostgreSQL servers whose data it would like to put onto a secondary server. For such purposes, hot streaming replication won't work. Fortunately, there are other solutions, such as Slony, that might be a better fit for the company's needs. But, for many sites that want to ensure their servers won't go down, this solution is more than adequate.
Am I gushing? Yes, absolutely. There are very few pieces of software that give me such pleasure to use and on which I feel I can utterly depend, and PostgreSQL is a select member of that club. It keeps my data safe, offers me a huge number of features, allows me to extend the database in whatever ways I want and need, and it gives me excellent performance with a modest time investment. If you're thinking of switching to an open-source database, or if you have already been using one and want to try something powerful but still rather easy to use, you would do well to try PostgreSQL. Especially with the latest features added in version 9.0, I'm sure you'll find something to like.
The home page for PostgreSQL is www.postgresql.org. From that site, you can download the software, subscribe to a number of (heavily trafficked and extremely friendly) e-mail lists and read through the documentation. There is also a #postgresql IRC channel on Freenode that is generally populated and available for answering questions.
One of the best-known commercial PostgreSQL support and product companies, EnterpriseDB, provides binary packages for a number of operating systems at its Web site, enterprisedb.com. It also offers a number of video and audio podcasts and screencasts, including those by PostgreSQL core contributor Bruce Momjian, which are of high quality.
Finally, two excellent books about PostgreSQL recently were published by Packt Press, filling in a long gap since the excellent (but outdated) PostgreSQL, 2nd edition was written by Korry Douglas a number of years ago. The two new books are PostgreSQL 9 Administration Cookbook by Simon Riggs and Hannu Krosing, and PostgreSQL 9.0 High Performance by Gregory Smith. I have learned a great deal from both books, and although the latter book spends a bit more time than I might have liked on hardware, it also taught me that even a software guy like myself occasionally needs to consider the physical side of software and database design. If you use PostgreSQL on a regular basis, I strongly recommend these books.
Reuven M. Lerner is a longtime Web developer, architect and trainer. He is a PhD candidate in learning sciences at Northwestern University, researching the design and analysis of collaborative on-line communities. Reuven lives with his wife and three children in Modi'in, Israel.