Open-Source Databases, Part II: PostgreSQL

Feature-rich PostgreSQL delivers on database integrity.

There are some problems with our table definition. Although we have effectively stopped people from storing NULL values in our TEXT columns, we haven't done anything to stop them from entering empty strings. In addition, we might want to ensure that the email_address column looks at least something like an e-mail address.

We can do this by adding constraints to our columns—tiny functions that check the value being inserted or updated. If the new value doesn't fit the definition, PostgreSQL refuses to allow its insertion. Here's a new definition of our table, with some constraints defined:

  id    SERIAL    NOT NULL,
  first_name    TEXT    NOT NULL  CHECK (first_name <> ''),
  last_name    TEXT    NOT NULL CHECK (last_name <> ''),
  email_address    TEXT NOT NULL CHECK (email_address ~* '.@.+\\\.'),


If we inspect our table definition, it has changed somewhat, to include the constraints:

linux=# \d people
                  Table "public.people"
    Column    |            Type             |      Modifiers
id            | integer                     | not null default
first_name    | text                        | not null
last_name     | text                        | not null
email_address | text                        | not null
added_at      | timestamp without time zone | not null default now()
  "people_pkey" PRIMARY KEY, btree (id)
  "people_email_address_key" UNIQUE, btree (email_address)
  Check constraints:
  "people_email_address_check" CHECK (email_address ~* '.@.+\\.'::text)
  "people_first_name_check" CHECK (first_name <> ''::text)
  "people_last_name_check" CHECK (last_name <> ''::text)

Let's see what happens if we violate these constraints:

linux=# insert into people (first_name , last_name, email_address)
    values ('', 'Lerner', '');
ERROR:  new row for relation "people" violates check constraint

linux=# insert into people (first_name , last_name, email_address)
    values ('Reuven2', 'Lerner2', 'reuven');
ERROR:  new row for relation "people" violates check constraint

Sure enough, our constraints help to ensure that our database is in order.

The most common type of constraint is a foreign key, in which one table points to another. For example:

CREATE TABLE Appointments (
    person_id    INTEGER     NOT NULL  REFERENCES People,
    starting_time    TIMESTAMP NOT NULL,
    duration           INTERVAL NOT NULL,
    notes TEXT NULL

If we try to create an appointment that refers to a non-existent person, we will be rejected:

INSERT INTO Appointments (person_id, starting_time, duration, notes)
    VALUES (5000, '2007-Feb-12 13:00', interval '1 hour', 'Lunch');

ERROR: insert or update on table "appointments" violates foreign key
constraint "appointments_person_id_fkey"
DETAIL: Key (person_id)=(5000) is not present in table "people".

Foreign-key constraints help in the other direction as well. If you try to drop a row to which a foreign key points, PostgreSQL will refuse the request, indicating that you must first delete the foreign key. You can adjust the rules for these constraints by setting the ON UPDATE or ON DELETE modifiers to the foreign key definition.

Other Features

This list of features is just the tip of the iceberg. And that's part of the magic of PostgreSQL—out of the box, it's straightforward and easy to use, but you almost always can redefine and extend existing functionality with your own code and data. The built-in operators, along with the flexible ways in which they can be combined and further enhanced with your own functions and definitions, make for a powerful combination. I don't often use unions or intersections, but I do often use views.

For example, one of my favorite features is the ability to use subselects just about anywhere you would have a value. If you have someone's e-mail address, you can use that to INSERT a row into Appointments in a single command:

INSERT INTO Appointments (person_id, starting_time, duration, notes)
                 FROM People
               WHERE email_address = ''),
            '2007-Feb-12 13:00', interval '1 hour', 'Lunch');

If the existing data types aren't enough, we can construct our own. PostgreSQL already comes with a number of existing data types, including geometric shapes, IP addresses and even ISBNs.

If we want to create more than one table with similar characteristics, we can take advantage of PostgreSQL's object-oriented features. Thus, we could have a People table and a Managers table, in which the definition of Manager inherits the characteristics of People and adds its own extensions.

You also can create your own server-side functions, in a variety of different languages—from PostgreSQL's own Pl/pgsql to specialized versions of Perl, Python, Tcl, Java, Ruby and the R statistical language. These functions can return individual values or entire tables, and can be used in triggers. You also can use these functions to rewrite the rules for inserting, updating and deleting data from a table or even a view.

But, perhaps the most important feature of all is the built-in support for transactions. Transactions are an essential part of database programming, in that they allow us to combine multiple queries into one all-or-nothing action. The classic example of a transaction is the movement of money from one bank account to another; if the power goes out, you want to be sure that the money was moved, or that it wasn't. It would be unacceptable for the money to disappear altogether or for it to appear in both accounts when the lights come back on.

Recent versions of PostgreSQL have enhanced its transactional capabilities. Not only can you commit or roll back a transaction, but you also can define savepoints inside a transaction. If something goes wrong, you can either roll back the entire transaction or merely go to the previous savepoint. Moreover, PostgreSQL now supports two-phase commits, making it possible to synchronize distributed processes that require communication and coordination.

If anything goes wrong, PostgreSQL also provides a PITR (point-in-time recovery) through a write-ahead log (WAL), ensuring that even if the power is cut off at the most critical moment, transactions will be committed or rolled back, and that as many transactions as possible will be committed.

You might have noticed that I haven't mentioned locking at all. That's because, for the most part, PostgreSQL users don't have to worry about locking. The lack of locking is handled using a system known as MVCC (multiversion concurrency control), which has only one drawback, namely the creation of many unused and cast-off database rows. The traditional way to handle this in PostgreSQL is to VACUUM the database regularly, removing old rows and clearing up space. Recent versions now include an auto-vacuum agent, reducing or even eliminating the need to VACUUM on a regular basis.

Finally, recent versions of PostgreSQL include support for tablespaces. This means you can spread tables across different directories and filesystems, rather than keep everything under the directory defined by your installation. This can boost performance or reliability significantly, particularly on large databases.