At the Forge - PostgreSQL 9.0
Until just a few years ago, anyone using the term database almost always was referring to a relational database—that is, one based on linked two-dimensional tables, which could be queried using the relatively standard SQL language. Sure, there were a variety of other types of databases, but they weren't in widespread use or taken very seriously.
In the past few years, however, there has been tremendous growth in the use and mindshare of so-called NoSQL databases. There's not much in common between these various databases—from Redis to MongoDB to CouchDB to Hadoop—that offer the promise of greater scalability and flexibility.
Now, I'm not one to say these non-relational databases are bad. On the contrary, I've used several of them in my own projects and found them to be a perfect fit for certain needs. But for day-to-day use, I continue to find relational databases an excellent fit. And when I need a relational database, I always prefer to use PostgreSQL. Indeed, when I'm working on a Web application, PostgreSQL is the default option. Only if I see that it won't work well (for whatever reason) do I switch some or all of the project to use a different database.
Why do I love PostgreSQL so much? First, because it's rock solid. The developers have an almost obsessive dedication to the safety of the data you store in the database, and they do their best to ensure that data won't ever be deleted or corrupted. This doesn't mean such situations never happen, but they're quite rare. PostgreSQL not only tries to ensure the safety of your data, but it also provides you with a variety of tools to help you constrain the values that may be stored in the database, ensuring that anything stored is guaranteed to be valid.
Second, PostgreSQL offers a wealth of features that never cease to impress and amaze me. Whether it's the number of languages in which you can write server-side functions or the different ways you can define indexes, the clever use of MVCC (multi-version concurrency control) to avoid writers blocking readers or the ongoing statistical analysis that feeds into the query optimizer, PostgreSQL hasn't failed me in more than 15 years of day-to-day use.
Every time I use another relational database, I find myself wishing for one or more functions that PostgreSQL provides or thinking that PostgreSQL wouldn't even enter into certain problematic situations, because of its high-quality architecture. This doesn't mean PostgreSQL is perfect, but its mix of features has served me quite well, and I'm not the only one to feel this way.
This month, I want to look at PostgreSQL 9.0, which was released in late 2010. In particular, I want to consider what advantages it offers over other open-source databases (mainly MySQL). For people already using PostgreSQL, what new features does the 9.0 release bring to the table?
I won't describe installing PostgreSQL, simply because versions are available for all major Linux distributions. Downloading and compiling the source code is relatively straightforward—although doing so as the dedicated “postgres” user, rather than as root, will save you many headaches during the installation process—but I have found binaries to be just fine for my day-to-day needs.
PostgreSQL is an open-source (BSD-licensed) relational database developed by a core group of developers working for a number of different companies. (Indeed, the rules of the core development group forbid more than a handful of core developers from working for the same company to ensure that no one organization controls development directly.) It grew out of the Postgres Project developed by Michael Stonebreaker at the University of California, Berkeley. Postgres itself was an attempt to improve upon the original Ingres database Stonebreaker had developed and commercialized.
Postgres used its own query language, known as QUEL, and had a number of advanced ideas, including many taken from the world of object-oriented programming. Several developers joined forces to take the Postgres database, replace QUEL with SQL, stabilize the code and release it under an open-source license. The first release of PostgreSQL, under its new name, came in 1996. Informally, a large number of PostgreSQL users and developers continue to call it Postgres, although that technically refers to Stonebreaker's project at Berkeley and not the current implementation.
One of the main goals of the PostgreSQL team has been to adhere to the SQL standard wherever possible. In addition, as I mentioned previously, PostgreSQL developers pride themselves on a rock-solid implementation, using the standard ACID (atomicity-consistency-isolation-durability) paradigm for database storage under all circumstances. They further try to balance between a powerful set of features and configurable options and overwhelming people who don't want to be full-time database administrators.
All PostgreSQL storage is transactional, using a system known as MVCC (multi-version concurrency control). MVCC, which also is used in Oracle and CouchDB, largely avoids conflicts between readers and writers, ensuring that neither rows nor tables need to be locked under most circumstances. MVCC does have the side effect of keeping “dead rows” around in the database, which occasionally need to be “vacuumed”, similar to garbage collection in many programming languages. For many years, vacuuming needed to be done manually, typically by running the vacuum command from a cron job. Nowadays, the autovacuum dæmon runs in the background, marking space as reusable when added or updated rows reach a configurable threshold.
Vacuum also can be run in “analyze” mode, in which case it examines tables and indexes, updating the statistics that are used when planning queries. This is one of the reasons PostgreSQL is able to operate so efficiently, even when handed complex queries. By keeping an up-to-date description of data stored in each table, it can make an intelligent decision regarding how to optimize the query. If the guesses turn out to be wrong for your particular needs, you can configure some configuration parameters.
PostgreSQL offers strong support for foreign keys, making it possible to ensure that connections between tables exist and work in the right way. That is, you can define a People table and then an Addresses table that refers to it:
CREATE TABLE People ( id SERIAL, first_name TEXT, last_name TEXT, email_address TEXT, PRIMARY KEY(id) ); CREATE TABLE Addresses ( id SERIAL, person_id INTEGER REFERENCES People, address TEXT, PRIMARY KEY(id) );
Now, let's try to insert a new record into the Addresses table, without inserting anything into People first:
INSERT INTO Addresses (person_id) VALUES (5);
This results in the following:
ERROR: insert or update on table "addresses" violates foreign key ↪constraint "addresses_person_id_fkey" DETAIL: Key (person_id)=(5) is not present in table "people".
Because I have defined person_id as a foreign key, PostgreSQL won't let me have an invalid value there, period. PostgreSQL also will refuse to let you remove a record from the People table if there are rows in the Addresses table referring to it. You can override such restrictions with the CASCADE keyword, but the database never will remove things on its own without your explicit approval.
You can, of course, also ensure that e-mail addresses are unique in your People table:
CREATE UNIQUE INDEX people_email_idx ON People(email_address);
But wait, what if someone enters an e-mail address in capital letters and then enters the same e-mail address in lowercase letters? You can ensure uniqueness by taking advantage of one of my favorite PostgreSQL features, a functional index:
CREATE UNIQUE INDEX people_email_idx ON ↪People(lower(email_address));
Now PostgreSQL will ensure its uniqueness, regardless of the case. This feature also comes in handy if you're trying to index a column containing long text strings. You can index only the first 1,000 characters:
CREATE UNIQUE INDEX people_email_idx ON ↪People(substring(email_address, 1, 1000));
Finally, another powerful feature is CTEs (common table expressions). If you often find yourself performing a query, putting the results in a temporary table and then querying that temp table, CTEs almost certainly are something you'll want to use. You basically describe the temporary table query, give it a name using the “with” statement, and then query that pseudo table. For example:
WITH Mini_users AS (SELECT id * 2 AS id2, email FROM Users) SELECT id2 FROM Mini_users;
In the example above, I create a new temporary table named mini_users and define it, and then query it as if it already existed. I already have found CTEs to be quite useful in simplifying complex queries. PostgreSQL also makes it easy to define recursive CTEs, allowing you to work with hierarchical data with a single, simple query.
Practical books for the most technical people on the planet. Newly available books include:
- Agile Product Development by Ted Schmidt
- Improve Business Processes with an Enterprise Job Scheduler by Mike Diehl
- Finding Your Way: Mapping Your Network to Improve Manageability by Bill Childers
- DIY Commerce Site by Reven Lerner
Plus many more.
- Building a Multisourced Infrastructure Using OpenVPN
- Handheld Emulation: Achievement Unlocked!
- Unikernels, Docker, and Why You Should Care
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- Happy GPL Birthday VLC!
- New Products
- diff -u: What's New in Kernel Development
- Controversy at the Linux Foundation
- Giving Silos Their Due
- Non-Linux FOSS: Snk