Database Integrity and Web Applications
Want to improve the integrity of your data? Place constraints in the database, as well as in your application.
NoSQL, the catchall phrase for non-relational databases, is all the rage among Web developers. However, it's somewhat unfair and unhelpful to use the term NoSQL to describe them, given the variety of technologies involved. Even so, there are some fundamental differences between traditional relational databases and their NoSQL counterparts. For one, as the name implies, NoSQL databases don't use the standard SQL query language, and use either their own SQL-like language (for example, MongoDB) or an object-oriented API. Another difference is the lack of two-dimensional tables; whereas SQL databases operate solely with such tables, NoSQL databases eschew them in favor of name-value pairs or hash-like objects. And finally, NoSQL databases typically lack the features that led to the development of relational databases, namely transactions and data integrity.
There's no doubt that the flexibility NoSQL databases offer is attractive on a number of levels. Just as I enjoy working with dynamic languages in which I don't have to declare my variables (or their types) before I use them, it's nice to be able to store objects in my database without having to define the object structure in advance. If I want to add a new field to my Person object, I just do so, and the database magically catches up.
At the same time, there are many cases when I want the database to be tough with me and enforce the integrity of my data. That is, I want to be sure that even if I have made a mistake in my application, or if a user enters a value that shouldn't be allowed, the database won't allow that bad data to be stored. And yes, I believe that it's good to have such checks at the database level, rather than just at the application level—not only because it provides an additional guard against corrupt data, but also because the database often is accessed directly, outside the application itself. I often have to do what I refer to as "database surgery" on applications that are running for my clients, and it's always reassuring to know I cannot make a change manually that would corrupt the data.
Readers of my column know I'm a fan of both Ruby on Rails and of PostgreSQL, and I often use them together on projects. However, because Rails originally was developed under MySQL, which lacks many of the data-integrity aspects of PostgreSQL, the standard Active Record package fails to include such items as foreign keys in its base implementation. This means that although Rails will support PostgreSQL out of the box, it doesn't provide support for foreign keys, let alone data-integrity checks.
So, in this article, I look at some Ruby gems that make it possible to include foreign-key support in your Active Record models. I also describe ways you can take advantage of PostgreSQL's CHECK mechanism to ensure that your data is as safe as possible.
Where Do You Check Your Data?
Before I go into the actual code and implementation, I admit there are several philosophies of where and how to check your data. As I mentioned previously, I prefer to have the data checked at both the database and application levels, although sometimes I've been lazy and used the default in Rails that does so only at the application level.
Checking your data only at the application level is attractive in many ways. It makes the implementation easier, and it often will work just fine. It also, in the case of Rails, allows you to work in a single language (that is, Ruby) and put the integrity checks in your model files, where you're going to be using them.
But as I already said, if I have my checks only in the application, it's quite possible that when I try to access the database from outside my application, I'll end up messing things up, if only accidentally. Now, I'll admit this is the way many Web applications are designed, and it's not a fatal flaw. But it is something that I'll try to avoid in this article.
It would be a bigger mistake to try to do things the other way around—that is, to have the checks only in the database, rather than at the application level. Imagine, for example, if you were to have a "NOT NULL" constraint in the database, such that a particular column could not contain a NULL value. If you didn't protect against such things at the application layer, you might end up trying to put that data into your database. And although the database itself would not be corrupted, the application would generate an error, confusing and frustrating users.
With this background, let's now assume that I want to implement a simple appointment calendar. In order to implement my calendar, I need to keep track of people (each of whom has a first name, a last name, an e-mail address and a telephone number) and appointments (which will indicate the date/time of the appointment, the person with whom I'm meeting and a note about the meeting itself).
At the database level, the tables will be fairly straightforward. This is how I would create the tables if I were doing so manually:
CREATE TABLE People ( id SERIAL, first_name TEXT NOT NULL, last_name TEXT NOT NULL, email TEXT NOT NULL, phone TEXT NOT NULL, PRIMARY KEY(id) ); CREATE TABLE Appointments ( id SERIAL, person_id INTEGER REFERENCES People NOT NULL, notes TEXT NOT NULL, PRIMARY KEY(id) );
The important thing here, for the purposes of this discussion, is the
REFERENCES People clause in the definition of the Appointments
table. With the REFERENCES clause in place, I can do the following:
INSERT INTO People (first_name, last_name, email, phone) VALUES ('Reuven', 'Lerner', 'firstname.lastname@example.org', ↪'847-230-9795'); INSERT INTO Appointments (person_id, notes) VALUES (1, 'Meet with ↪myself');
Now, here's what happens if I try to delete myself from the People table:
atf=# DELETE FROM People WHERE id = 1; ERROR: update or delete on table "people" violates foreign key constraint "appointments_person_id_fkey" on table "appointments" DETAIL: Key (id)=(1) is still referenced from table "appointments".
Because Appointments.person_id is a foreign key to People.id, I cannot remove a row from People if Appointments refers to it. This sort of relational integrity is a great thing for my application. No matter how you slice it, it means I cannot remove people from the system if they are scheduled for an appointment.
If these tables were part of a Web application that I was developing in Ruby on Rails, I would use database migrations to create them. The migrations would look something like this (if combined into a single file):
class CreatePeopleAndAppointments < ActiveRecord::Migration def change create_table :people do |t| t.string :first_name, :null => false t.string :last_name, :null => false t.string :email, :null => false t.string :phone, :null => false t.timestamps end create_table :appointments do |t| t.integer :person_id, :null => false t.text :notes, :null => false t.timestamps end end end
Now, this will give me the table definitions, but it won't give me the foreign keys, meaning that I could corrupt the database by deleting a single "People" record. In order to do that, at least in the default Rails configuration, I need to use an "execute" command in the migration to send explicit SQL to the database.
Fortunately, there is an easier way. The Foreigner gem, written by
Matthew Higgins, which works with MySQL as well as with PostgreSQL,
adds syntax to let you create and remove foreign keys within your
migrations. For example, with Foreigner active—putting it in the
Gemfile and then running
bundle install—I can add a new migration
that does the following:
class AddForeignKey < ActiveRecord::Migration def up add_foreign_key :appointments, :people end def down remove_foreign_key :appointments, :people end end
Sure enough, after running this migration, I have the foreign key that I was hoping for. Again, it's still important for me to have this check in my Rails model; otherwise, I easily could get myself into a situation that the database forbids but the model allows and, thus, generate a bad error for my users.
Note: if you add the foreign key after you already have populated the database with some data, you might run into trouble. That's because PostgreSQL won't let you add a foreign key if one or more rows fail to abide by the declared constraint. The NOT NULL constraint additionally ensures that you point to some person in the system. In other words, every appointment has to be with someone, and it has to be with someone in the People table.
Now, let's say you already have been working on a project, but you have neglected to define foreign keys. You could go through each table, figure out which is pointing to which, and then handle it accordingly, adding the sorts of migrations shown above. But the Immigrant gem (also written and released by Matthew Higgins) looks at Rails models and adds foreign keys for every has_one and has_many column that points to a belongs_to column in another model.
Foreigner is a great step forward, helping ensure the integrity of your data. But there is another issue, more subtle than that of foreign keys, to which I'm still susceptible. While I have ensured that the person_id will not be NULL and will point to a record in the People table, I haven't ensured that the People table will contain valid and reasonable values. That is, although the first_name, last_name, e-mail and phone columns will not contain NULLs, they might contain empty strings or values (for example, e-mail addresses) that are invalid.
At the database level, I could handle this sort of problem with a CHECK clause. Such a clause ensures that illegal data—for whatever "illegal" values I want to define—cannot be placed in the database. This could be anything from a certain minimum or maximum length of text string, to a pattern in the text, to minimum or maximum numbers. For example, I often like to indicate that the database may not contain a price lower than 0, and that e-mail addresses need to match a very basic regular expression. (Note that matching e-mail addresses is not for the faint of heart, at least if you want to do it correctly.)
So given my People table, I could define a set of CHECK clauses that would ensure that the first_name field is non-empty. In other words, the first_name cannot be NULL, but it also cannot be the empty string:
ALTER TABLE People ADD CONSTRAINT people_first_name_non_empty_chks CHECK (first_name <> '');
Note that although I could add a number of checks within this single constraint, I prefer not to do so. That gives me greater flexibility to add and remove constraints, and it also ensures that when a constraint is violated, PostgreSQL will accurately tell me which one it was.
Now, how can I implement this in my Rails migration, and do I want to? My answer, as you can imagine, is that this would indeed be a good thing to include in the database.
Once again, a Ruby gem comes to the rescue. This one, sexy_pg_constraints, was written by Maxim Chernyak, but it has since forked and is maintained by several other people.
I can include it in my Rails application by adding the following to my Gemfile:
gem 'Empact-sexy_pg_constraints', :require => 'sexy_pg_constraints'
and by uncommenting the following line in config/application.rb:
config.active_record.schema_format = :sql
Simply put, sexy_pg_constraints adds a number of additional attributes that you can pass to an add_column or change_column invocation within your migration. For example, let's say I want to make sure, as before, that the first_name column is never blank—neither NULL nor the empty string. I can do this by saying:
class AddConstraints < ActiveRecord::Migration def up constrain :people, :first_name, :not_blank => true end def down deconstrain :people, :first_name end end
After I apply this migration, I discover the following in my table definition:
"people_first_name_not_blank" CHECK (length(btrim(first_name::text)) > 0)
In other words, I am checking to ensure that after removing all whitespace from either side of the string, the length is greater than 0. Sounds like it worked to me!
sexy_pg_constraints comes with a large number of options, including whitelisting, blacklisting, matching e-mail addresses and checking the format of data. The only thing this gem is missing, by my estimate, is a way to get the model file to communicate automatically with the database and/or the migrations, so that you don't have to add these in both places manually. Even so, by adding these constraints, you improve the integrity of your data without having to go too far out of the Rails migration framework.
Database constraints are there to save people from themselves, and they are a great feature offered by relational databases. In exchange for a bit of work up front, and for some small performance penalties during runtime, you can ensure that your data remains intact. I would argue that doing this sort of validation is important at both the database and the application levels. A number of Ruby gems, as I explained here, make it possible to do this sort of integration within Ruby on Rails.
Information about PostgreSQL and constraints is on the Web at http://www.postgresql.org/docs/current/static/ddl-constraints.html, and this should be read by anyone interested in the subject.
A recent branch of the sexy_pg_constraints gem is at https://github.com/carbonfive/sexy_pg_constraints.
Database image via Shutterstock.com.