PostgreSQL 10: a Great New Version for a Great Database
Reuven reviews the latest and most interesting features in PostgreSQL 10.
PostgreSQL has long claimed to be the most advanced open-source relational database. For those of us who have been using it for a significant amount of time, there's no doubt that this is true; PostgreSQL has consistently demonstrated its ability to handle high loads and complex queries while providing a rich set of features and rock-solid stability.
But for all of the amazing functionality that PostgreSQL offers, there have long been gaps and holes. I've been in meetings with consulting clients who currently use Oracle or Microsoft SQL Server and are thinking about using PostgreSQL, who ask me about topics like partitioning or query parallelization. And for years, I've been forced to say to them, "Um, that's true. PostgreSQL's functionality in that area is still fairly weak."
So I was quite excited when PostgreSQL 10.0 was released in October 2017, bringing with it a slew of new features and enhancements. True, some of those features still aren't as complex or sophisticated as you might find in commercial databases. But they do demonstrate that over time, PostgreSQL is offering an amazing amount of functionality for any database, let alone an open-source project. And in almost every case, the current functionality is just the first part of a long-term roadmap that the developers will continue to follow.
In this article, I review some of the newest and most interesting features in PostgreSQL 10—not only what they can do for you now, but what you can expect to see from them in the future as well. If you haven't yet worked with PostgreSQL, I'm guessing you'll be impressed and amazed by what the latest version can do. Remember, all of this comes in an open-source package that is incredibly solid, often requires little or no administration, and which continues to exemplify not only high software quality, but also a high-quality open-source project and community.
PostgreSQL's transactions are handled using a system known as MultiVersion Concurrency Control (MVCC), which reduces the number of times the database must lock a row. This doesn't mean that deadlocks never happen, but they tend to be rare and are relatively easy to avoid. The key thing to understand in PostgreSQL's MVCC is that deleting a row doesn't actually delete it, but merely marks it as deleted by indicating that it should no longer be visible after a particular transaction. When all of the transaction IDs are greater than that number, the row's space can be reclaimed and/or reused, a process known as "vacuuming". This system also means that different transactions can see different versions of the same row at the same time, which reduces locks. MVCC can be a bit hard to understand, but it is part of PostgreSQL's success, allowing you to run many transactions in parallel without worrying about who is reading from or writing to what row.
The PostgreSQL project started more than 20 years ago, thanks to a merger between the "Postgres" database (created by Michael Stonebreaker, then a professor at Berkeley, and an expert and pioneer in the field of databases) and the SQL query language. The database tries to follow the SQL standard to a very large degree, and the documentation indicates where commands, functions and data types don't follow that standard.
For two decades, the PostgreSQL "global development group" has released a new version of the database roughly every year. The development process, as you would expect from an established open-source project, is both transparent and open to new contributors. That said, a database is a very complex piece of software, and one that cannot corrupt data or go down if it's going to continue to have users, so development tends to be evolutionary, rather than revolutionary. The developers do have a long-term roadmap, and they'll often roll out features incrementally across versions until they're complete. Beyond the core developers, PostgreSQL has a large and active community, and most of that community's communication takes place on email lists.
Open-source projects often avoid making a big deal out of a software release. After all, just about every release of every program fixes bugs, improves performance and adds features. What does it matter if it's called 3.5 or 2.8 or 10.0?
That said, the number of huge features in this version of PostgreSQL made it almost inevitable that it was going to be called 10.0, rather than 9.7 (following the previous version, 9.6). What is so deserving of this big, round number?
Two big and important features were the main reasons: logical replication and better table partitions. There were many other improvements, of course, but in this article, I focus on these big changes.
Before continuing, I should note that installing PostgreSQL 10 is quite easy, with ports for many operating systems—including various Linux distributions—readily available. Go to the main PostgreSQL site, and click on the link for "download". That will provide the instructions you need to add the PostgreSQL distribution to the appropriate package repository, from which you can then download and install it. If you're upgrading from a previous version, of course, you should be a bit more conservative, double-checking to make sure the data has been upgraded correctly.
I also should note that in the case of Ubuntu, which I'm running on my server, the number of packages available for PostgreSQL 10 is massive. It's normal to install only the base server and client packages, but there are additional ones for some esoteric data types, foreign data wrappers, testing your queries and even such things as an internal cron system, a query preprocessor and a number of replication options. You don't have to install all of them, and you probably won't want to do so, but the sheer number of packages demonstrates how complex and large PostgreSQL has become through the years, and also how much it does.
For years, PostgreSQL lacked a reasonable option for replication. The best you could do was take the "write-ahead logs", binary files that described transactions and provided part of PostgreSQL's legendary stability, and copy them to another server. Over time, this became a standard way to have a slave server, until several years ago when you could stream these write-ahead log (WAL) files to another server. Master-slave replication thus became a standard PostgreSQL feature, one used by many organizations around the world—both to distribute the load across multiple servers and to provide for a backup in the case of server failure. One machine (the master) would handle both read and write queries, while one or more other (slave) machines would handle read-only queries.
Although streaming WALs certainly worked, it was limited in a number of ways. It required that both master and slave use the same version of PostgreSQL, and that the entire server's contents be replicated on the slave. For reasons of performance, privacy, security and maintenance, those things deterred many places from using PostgreSQL's master-slave streaming.
So it was with great fanfare that "logical replication" was included in PostgreSQL 10. The idea behind logical replication is that a server can broadcast ("publish") the changes that are made not using binary files, but rather a protocol that describes changes in the publishing database. Moreover, details can be published about a subset of the database; it's not necessary to send absolutely everything from the master to every single slave.
In order to get this to work, the publishing server must create a
"publication". This describes what will be sent to subscribing servers. You
can use the new
CREATE PUBLICATION command to do this.
As I wrote above, replication of the WAL files meant that the entire database server (or "cluster", in PostgreSQL terminology) needed to be replicated. In the case of logical replication, the replication is done on a per-database basis. You then can decide to create a publication that serves all tables:
CREATE PUBLICATION mydbpub FOR ALL TABLES;
Note that when you say
FOR ALL TABLES, you're indicating that you want to
publish not only all of the tables that currently exist in this database, but
also tables that you will create in the future. PostgreSQL is smart enough to
add tables to the publication when they are created. However, the subscriber
won't know about them automatically (more on that to come).
If you want to restrict things, so that only a specific table is replicated, you can do so with this:
CREATE PUBLICATION MyPeoplePub FOR TABLE People;
You also can replicate more than one table:
CREATE PUBLICATION MyPeopleFooPub FOR TABLE People, Foo;
If you are publishing one or more specific tables, the tables must already exist at the time you create the publication.
The default is to publish all actions that take place on the published
tables. However, a publication can specify that it's going to publish
only inserts, updates and/or deletes. All of this is configurable when the
publication is created, and can be updated with the
If you're using the interactive "psql" shell, you can take a look at
current publications with
\dRp, which is short for "describe replication
publications". It's not the easiest command to remember, but they long ago
ran out of logical candidates for single-letter commands. This command will
show you which publications have been defined and also what permissions they
have (more on that in a moment). If you want to know which tables are
included in a publication, you can use
Once you've set up the publication, you can set up a subscription with (not
CREATE SUBSCRIPTION command. Here, things are a bit
trickier, because the data is actually arriving into the subscriber's
database, which means there might be conflicts or issues.
First and foremost, creating a subscription requires that you have a valid login (user name and password) on the publisher's system. With that in hand, you can say:
CREATE SUBSCRIPTION mysub CONNECTION 'host=mydb user=myuser' ↪PUBLICATION MyPeoplePub;
Notice that you use a standard PostgreSQL "connecting string" to connect to the server. You can use additional options if you want, including setting the port number and the connection timeout. Because a database might have multiple publications, you have to indicate the publication name to which you want to subscribe, as indicated here. Also note that the user indicated in this connection string must have "replication" privileges in the database.
Once the subscription has been created, the data will be replicated from its current state on the publisher.
I've already mentioned that using the
FOR ALL TABLES option with
PUBLISHER means that even if and when new tables are added, they will be
included as well. However, that's not quite true for the subscriber. On the
subscriber's side, you need to indicate that there have been changes in the
publisher and that you want to refresh your subscription:
ALTER SUBSCRIPTION testsub REFRESH PUBLICATION;
If you've done any binary replication in previous PostgreSQL versions, you already can see what an improvement this is. You don't have to worry about WALS, or about them being erased, or about getting the subscribing server up to speed and so forth.
Now, it's all well and good to talk about replication, but there's always the possibility that problems will arise. For example, what happens if the incoming data violates one or more constraints? Under such circumstances, the replication will stop.
There are also a number of caveats regarding what objects are actually replicated—for example, only tables are replicated, such objects as views and sequences are not.
Let's say you're using PostgreSQL to keep track of invoices. You might want to have an "invoices" table, which you can query by customer ID, date, price or other factors. That's fine, but what happens if your business becomes extremely popular, and you're suddenly handling not dozens of customers a month, but thousands or even millions? Keeping all of that invoicing data in a single database table is going to cause problems. Not only are many of the older invoices taking up space on your primary filesystem, but your queries against the table are going to take longer than necessary, because these older rows are being scanned.
A standard solution to this problem in the database world is partitioning. You divide the table into one or more sub-tables, known as "partitions". Each partition can exist on a different filesystem. You get the benefits of having a single table on a single database, but you also enjoy the benefits of working with smaller tables.
Unfortunately, such partitioning was available in previous versions of PostgreSQL—and although it worked, it was difficult to install, configure and maintain. PostgreSQL 10 added "declarative partitioning", allowing you to indicate that a table should be broken into separate partitions—meaning that when you insert data into a partitioned table, PostgreSQL looks for the appropriate partition and inserts it there.
PostgreSQL supports two types of partitioning schemes. In both cases, you have to indicate one or more columns on which the partitioning will be done. You can partition according to "range", in which case each partition will contain data from a range of values. A typical use case for this kind of partition would be dates, such as the invoices example above.
But, you also can partition over a "list" value, which means that you divide things according to values. For example, you might want to have a separate partition for each state in the US or perhaps just for different regions. Either way, the list will determine which partition receives the data.
For example, you can implement the date invoice example from above as follows. First, create an Invoices table:
postgres=# CREATE TABLE Invoices ( id SERIAL, issued_at TIMESTAMP NOT NULL, customer_name TEXT NOT NULL, amount INTEGER NOT NULL, product_bought TEXT NOT NULL ) partition by range (issued_at); CREATE TABLE
(And yes, in an actual invoice system, you would be using foreign keys to keep track of customers and products.)
Notice that at the conclusion of the
CREATE TABLE command, I've added a
"partition by range" statement, which indicates that partitions of this table
will work according to ranges on
issued_at, a timestamp.
But perhaps even more interesting is the fact that
SERIAL (that is,
sequence) value, is not defined as a primary key. That's because you cannot
have a primary key on a partitioned table; that would require checking a
constraint across the various partitions, which PostgreSQL cannot guarantee.
With the partitioned table in place, you now can create the individual partitions:
postgres=# CREATE TABLE issued_at_y2018m01 PARTITION OF Invoices FOR VALUES FROM ('2018-jan-01') to ('2018-jan-31'); CREATE TABLE postgres=# CREATE TABLE issued_at_y2018m02 PARTITION OF Invoices postgres-# FOR VALUES FROM ('2018-feb-01') to ('2018-feb-28'); CREATE TABLE
Notice that these partitions don't have any column definition. That's because
the columns are dictated by the partitioned table. In
psql, I can ask for a
description of the first partition. See Table 1 for an example of what this
would look like.
Partition of: invoices FOR VALUES FROM ('2018-01-01 00:00:00') ↪TO ('2018-01-31 00:00:00')
You can see from the example shown in Table 1 not only that the partition acts like a regular table, but also that it knows very well what its range of values is. See what happens if I now insert rows into the parent "invoices" table:
postgres=# insert into invoices (issued_at , customer_name, ↪amount, product_bought) postgres-# values ('2018-jan-15', 'Jane January', 100, 'Book'); INSERT 0 1 postgres=# insert into invoices (issued_at , customer_name, ↪amount, product_bought) values ('2018-jan-20', 'Jane January', 200, 'Another book'); INSERT 0 1 postgres=# insert into invoices (issued_at , customer_name, ↪amount, product_bought) values ('2018-feb-3', 'Fred February', 70, 'Fancy pen'); INSERT 0 1 postgres=# insert into invoices (issued_at , customer_name, ↪amount, product_bought) values ('2018-feb-15', 'Fred February', 60, 'Book'); INSERT 0 1
So far, so good. But, now how about a query on "invoices":
postgres=# select * from invoices; id | issued_at | customer_name | amount | product_bought ----+---------------------+---------------+--------+---------------- 3 | 2018-02-03 00:00:00 | Fred February | 70 | Fancy pen 4 | 2018-02-15 00:00:00 | Fred February | 60 | Book 1 | 2018-01-15 00:00:00 | Jane January | 100 | Book 2 | 2018-01-20 00:00:00 | Jane January | 200 | Another book (4 rows)
I also can , if I want, query one of the partitions directly:
postgres=# select * from issued_at_y2018m01 ; id | issued_at | customer_name | amount | product_bought ----+---------------------+---------------+--------+---------------- 1 | 2018-01-15 00:00:00 | Jane January | 100 | Book 2 | 2018-01-20 00:00:00 | Jane January | 200 | Another book (2 rows)
Although you don't have to do so, it's probably a good idea to set an index on the partition key on each of the individual partitions:
postgres=# create index on issued_at_y2018m01(issued_at); CREATE INDEX postgres=# create index on issued_at_y2018m02(issued_at); CREATE INDEX
That will help PostgreSQL find and update the appropriate partition.
Not everything is automatic or magical here; you'll have to add partitions, and you even can remove them when they're no longer needed. But this is so much easier than used to be the case, and it offers more flexibility as well. It's no surprise that this is one of the features most touted in PostgreSQL 10.
I've personally been using PostgreSQL for about 20 years—and for so many years people said, "Really? That's your preferred open-source database?" But, now a large and growing number of people are adopting and using PostgreSQL. It already was full of great features, but there's always room to improve—and with PostgreSQL 10, there are even more reasons to prefer it over the alternatives.
To learn more about PostgreSQL, download the code, read the documentation and sign up for the community e-mail lists, go to https://www.postgresql.org.
About the Author
Reuven Lerner teaches Python, data science and Git to companies around the world. His free, weekly "better developers" email list reaches thousands of developers each week; subscribe here. Reuven lives with his wife and children in Modi'in, Israel.
Limited Time Offer
Take Linux Journal for a test drive. Download our September issue for FREE.
Topic of the Week
The cloud has become synonymous with all things data storage. It additionally equates to the many web-centric services accessing that same back-end data storage, but the term also has evolved to mean so much more.