Open-Source Databases, Part II: PostgreSQL

Feature-rich PostgreSQL delivers on database integrity.

where <username> is a PostgreSQL user with sufficient access privileges on the server. The -U option lets us indicate the user name, and the -l option asks the server to list those databases that are available, such as:

List of databases
    Name   |    Owner  | Encoding
testserver | reuven    | SQL_ASCII
postgres   | postgres  | SQL_ASCII
template0  | postgres  | SQL_ASCII
template1  | postgres  | SQL_ASCII
(4 rows)

Like many database systems, PostgreSQL maintains its own list of users and groups separately from the underlying operating system. To create a new user, we use the createuser program, which comes with PostgreSQL:

createuser -U postgres reuven

This invokes createuser as postgres (which has permission to create other users) and then creates a new user named reuven. If we make this new user a superuser, reuven also will be allowed to create new databases and roles.

Now, we can create a new database in this cluster:

createdb -U reuven linux

Double-check that the database exists with a reuse of psql -l:

psql -U reuven -l

You might have noticed that we have not given a password any time we have invoked a command that required a user. The default setting for PostgreSQL makes the server available via a local socket (and thus unavailable over the network). Because only local users will be allowed access, we allow connections from any defined user name, even without a password.

We can change this behavior, as well as other security- and connection-related behaviors, in the pg_hba.conf file located at the top of the cluster directory. The file contains extensive documentation and explains how to set up the connection parameters.

Creating a Table

Let's connect to our database and see what happens:

$ psql -U reuven linux

Sure enough, we get the psql prompt, inviting us to enter a query. A list of psql comments is available by typing \? at the prompt. We also can get help with SQL syntax and commands with \h, as in:


Sure enough, let's start by creating a table:

    id    SERIAL    NOT NULL,
    first_name    TEXT    NOT NULL,
    last_name    TEXT    NOT NULL,
    email_address    TEXT NOT NULL,

    PRIMARY KEY(id),

This table has five columns and three different data types. The most common type is TEXT, which is the typical way to store textual data. PostgreSQL has full support for Unicode, and TEXT columns may contain very long strings. (Until version 8.0, PostgreSQL could store a limited amount of data in each row, so this might come as news to some people.) There is no built-in support for indexing TEXT columns, but an add-on module, called tsearch2, is relatively easy to install and provides such functionality for anyone who needs it.

The added_at column is of type TIMESTAMP. PostgreSQL provides a very sophisticated set of time- and date-handling routines, thanks to the division between two basic data types, known as TIMESTAMP and INTERVAL. A timestamp indicates when something happened, whereas an interval tells you how long something took. We thus can subtract two timestamps (to get an interval) or add an interval to a timestamp (to get a new timestamp). Moreover, PostgreSQL lets us define intervals using English-like syntax, such as:

SELECT id FROM People WHERE added_at > NOW ()
 ↪- interval '30 days';

The above query shows us all of the people who were added to our contact database in the last 30 days.

The added_at column defines a DEFAULT value as well. This means that if we fail to INSERT an explicit value for added_at, PostgreSQL will use the current time (at the time of insertion).

The SERIAL Data Type

The id column, which we define to be our primary key, uses a SERIAL data type. The thing is, SERIAL isn't a data type at all. Rather, it's syntactic sugar that does several things:

  • It creates a new sequence object, whose values are integers that start at 1 and increase each time we ask for a value.

  • It defines the column type to be INTEGER.

  • It sets the DEFAULT clause for our column to be the result of requesting a new value from the sequence.

This might sound like a complicated way of saying, “SERIAL gives us an auto-incrementing column”. And that's true, in a sense, but you can have as many SERIAL columns as you like in a table, and each sequence can have all sorts of properties associated with it, including its starting point and increment.

Finally, by defining id to be a primary key and email_address to be unique, we implicitly ask PostgreSQL to create indexes on these two columns. When we execute the above query, PostgreSQL notifies us what it's doing behind the scenes:

NOTICE:  CREATE TABLE will create implicit sequence
"people_id_seq" for serial column ""
implicit index "people_pkey" for table "people"
NOTICE:  CREATE TABLE / UNIQUE will create implicit
index "people_email_address_key" for table "people"

If and when you drop the People table, these implicitly defined objects are dropped automatically.

To list all of the tables, sequences and views in our database, we can use the \d command:

linux=# \d
     List of relations
 Schema |     Name      |   Type   | Owner
 public | people        | table    | reuven
 public | people_id_seq | sequence | reuven
(2 rows)

We can add a letter to \d to get a list of only tables (t), indexes (i), sequences (s), functions (f) or views (v). For example, here is a list of the indexes that we have created:

linux=# \di
          List of relations
Schema |           Name           | Type  | Owner  | Table
public | people_email_address_key | index | reuven | people
public | people_pkey              | index | reuven | people
(2 rows)

We can also use \d to inspect a particular object more closely. For example, we can look at our People table with \d People:

linux=# \d People
                   Table "public.people"
Column        |      Type                   |  Modifiers
id            | integer                     | not null default
first_name    | text                        | not null
last_name     | text                        | not null
email_address | text                        | not null
added_at      | timestamp without time zone | not null default now()
"people_pkey" PRIMARY KEY, btree (id)
"people_email_address_key" UNIQUE, btree (email_address)

There are several things to notice in the above output:

  • First, PostgreSQL sees the table as public.people, not just people. This is because every object must exist inside of a schema, or namespace, and the default schema is called public. We can use schemas to partition the namespace within a particular database or to handle partitions. This means we don't need to split data across two more databases just to deal with conflicting permissions and names.

  • The table name, as well as all column names, are displayed in lowercase letters. That's because PostgreSQL tries to adhere to the SQL standard as best as possible, and the standard says that identifiers should be case-insensitive. If you really want case-sensitive names (and you probably don't), use double quotes around the identifiers.

  • Our id column has been transformed, as expected, into an integer column with a default value taken from a sequence.