At the Forge - Shoehorning Data into a Database

HOWTOs

by Reuven M. Lerner

on September 1, 2008

Relational databases are really great for storing and retrieving data, but sometimes, they aren't quite up to the task. Joe Celko, whose SQL for Smarties books are among my favorites, dedicated an entire volume to the issue of trees and hierarchies. These data structures might be common and useful in most programming languages, but they can be difficult to model as tables, particularly if you care about efficient use of the database. Things become even trickier if you're dealing with a number of related, but distinct, types of entities, such as different types of employees or different types of vehicles.

One way to solve this problem is not to use relational databases. Objects can be quite good at handling trees and arrays, as well as inheritance hierarchies. Furthermore, object databases do exist, and the Python-based Zope application framework has demonstrated that it's even possible to have object databases in production. Gemstone's demonstration of Ruby running on top of its Smalltalk VM, with its accompanying object database, means that Ruby programmers soon might have access to similar technology.

But, object databases still are far from the mainstream. Most Web developers have access to a relational database, and not much else. Is there anything that we can do for these people?

This month, we take a look at two different ways we can handle data that doesn't quite fit into a relational database. These techniques are quite different from one another, and they don't even come close to the full range of possibilities you can get with a relational database. But, they both work and are used in production environments—and if your data doesn't seem to fit into standard database paradigms, you might want to consider one of them.

PostgreSQL's Table Inheritance

Some data-modeling issues are typically even harder to deal with. For example, a classic introduction to the world of object-oriented programming describes a human resources department. The HR department tracks employees, all of whom have some common characteristics. But, some employees are programmers, some are secretaries, and some are managers—and each of the employee types has specific data that needs to be associated with them.

In an object-oriented world, it's easy to model this. You create an employee class, and then create multiple subclasses of programmer, secretary and manager. Subclassing creates an “is-a” relationship, such that a programmer is an employee. This means that programmers have all the attributes of an employee, but also have some additional characteristics that distinguish them from an ordinary employee. With these subclasses in place, we then can create an array (or any other data structure) of people in our company, knowing that although some are programmers and others are secretaries, they're all employees and can be treated as such.

Translating this idea to the world of relational databases can be a bit tricky. One solution is to use inheritance in your database tables. PostgreSQL has done this for years; thus, it's called an object-relational database by many users. You can do the following in PostgreSQL, for example:


CREATE TABLE Employees (
    id            SERIAL,
    first_name    TEXT    NOT NULL,
    last_name     TEXT    NOT NULL,
    email_address TEXT    NOT NULL,

    PRIMARY KEY(id),
    UNIQUE(email_address)
);

CREATE TABLE Programmers (
    main_language    TEXT    NOT NULL
) INHERITS(Employees);

CREATE TABLE Secretaries (
    words_per_minute    INTEGER    NOT NULL
) INHERITS(Employees);


INSERT INTO Employees (first_name, last_name, email_address)
    VALUES ('George', 'Washington', 'georgie@whitehouse.gov');

INSERT INTO Programmers (first_name, last_name,
                         email_address, main_language)
    VALUES ('Linus', 'Torvalds', 'torvalds@osdl.org', 'C');

INSERT INTO Secretaries (first_name, last_name,
                         email_address, words_per_minute)
    VALUES ('Condoleezza', 'Rice', 'rice@state.gov', 10);

If we ask for all employees in the system, we'll get all three of the people we have entered:


atf=# select * from employees;
 id | first_name | last_name  |     email_address
----+------------+------------+------------------------
  1 | George     | Washington | georgie@whitehouse.gov
  2 | Linus      | Torvalds   | torvalds@osdl.org
  3 | Condoleezza| Rice       | rice@state.gov
(3 rows)

Of course, this query shows only the columns of the Employees table, which are common to that table and to those that inherit from it. If we want to find out how many words per minute someone types, we must address that query specifically to the Secretaries table:


atf=# select * from secretaries;
 id | first_name | last_name | email_address  | words_per_minute
----+------------+-----------+----------------+------------------
  3 | Condoleezza| Rice      | rice@state.gov |               10
(1 row)

Notice that the id column for all three tables, which was defined as SERIAL (that is, a nonrepeating incrementing integer), is unique across all three tables.

Polymorphic Associations

The way that PostgreSQL has integrated this type of object hierarchy into its relational system is impressive, flexible and useful. And yet, because it is unique to PostgreSQL, it means that no higher-level, database-agnostic application framework can support it. This especially is true in Ruby on Rails, which tries to treat all databases as similar or identical, going so far as to encourage programmers to use a Ruby-based domain-specific language (migrations) to create and modify database definitions. Using PostgreSQL's inheritance features might work, but it will take a fair amount of twisting to make it compatible with Rails.

Besides, Rails already has a feature, known as polymorphic associations, that lets us work with distinct types of items as if they were part of a single class. This isn't the same as an object hierarchy—we can't say that secretaries and programmers are both types of employees. But, we can say that secretaries and programmers are both employable and treat them as similar via that description.

To begin, you might remember that Rails has something known as associations, which allow us to connect one model to another. For example, let's say that each company has one or more employees. Thus, we can create some simple models. We can generate migrations with:

./script/generate model company name:string
./script/generate model employee first_name:string
    last_name:string email_address:string company_id:integer

Then, we can turn the automatically generated migration files into actual database tables with the following:

rake db:migrate

Now, we can indicate that each company can have one or more employees by modifying the model files. For example, we add the following to employee.rb:


class Company < ActiveRecord::Base
  has_many :employees
end

Similarly, we can say:


class Employee < ActiveRecord::Base
  belongs_to :company
end

With has_many and belongs_to in place, we now have created an “association” between these two models. This might not seem too exciting, but it means we can treat the two tables as object classes and each row in the table as an instance:


xyz = Company.create(:name => 'XYZ Corporation')

george = Employee.create(:first_name => 'George',
    :last_name => 'Washington',
    :email_address => 'georgie@whitehouse.gov',
    :company_id => xyz.id)

Now, we can say:

p xyz.employees.first

and we get back our george user. Similarly, we can say:

p george.company

and get back our xyz company. This is all standard stuff for Rails programmers, and it is part of the ActiveRecord feature known as associations. You can create all sorts of associations, giving them arbitrary names. For example, we could say:


class Company < ActiveRecord::Base
  has_many :employees
  has_many :employees_with_a, :class_name => 'Employee',
            :conditions => "first_name ilike '%a%'"
end

With this in place, and after restarting the console (or typing reload!), we now can say:

xyz = Company.find_by_name('XYZ Corporation')

xyz.employees_with_a

This prints the empty list—not surprising, given that we have defined only a single employee so far, and his name didn't contain an a. But, now we can create a second employee:


jane = Employee.create(:first_name => 'Jane',
                       :last_name => 'Austin',
                       :email_address => 'jane@bookauthor.com',
                       :company_id => xyz.id)

If we run our association again:

xyz.employees_with_a

now we get our jane employee.

This is all well and good, but what happens if we want to represent different types of employees, each of whom is employed by a company, but with different associated data? This is where polymorphic associations become useful. In order for this to work, we need to change the definitions of our models, as well as the relationships among them (if you're playing along at home, blow away the existing Employee and Company models before continuing):

./script/generate model company name:string
./script/generate model contract employable_id:integer
 employable_type:string company_id:integer
./script/generate model programmer main_language:string
 first_name:string last_name:string email_address:string
./script/generate model secretary words_per_minute:integer
 first_name:string last_name:string email_address:string

The above invocations of script/generate create four different models: one for a company, another for a programmer, another for a secretary and a fourth for a contract. Our PostgreSQL model allowed us to have a single Employee table and to have programmers and secretaries inherit from that table. Rails doesn't let us specify that one model inherits from another. Rather, we use Rails to describe the relationships among the models. Companies are connected to programmers and secretaries via employment contracts.

Because we are looking at the relationships among standalone models, rather than an inheritance hierarchy, there's no obviously good place in which to stick attributes that are common to programmers and secretaries. In the end, I decided to put the attributes in the programmer and secretary models, respectively, despite the repetition.

Now, let's define the associations:


class Company < ActiveRecord::Base
  has_many :contracts
end

class Contract < ActiveRecord::Base
  belongs_to :company
  belongs_to :employable, :polymorphic => true
end

class Programmer < ActiveRecord::Base
  has_many :contracts, :as => :employable
  has_many :companies, :through => :contracts
end

class Secretary < ActiveRecord::Base
  has_many :contracts, :as => :employable
  has_many :companies, :through => :contracts
end

In other words, each company has many contracts. Each contract joins together a company and someone who is employable. Who is employable? Right now, only programmers and secretaries fit the bill, connecting to the employable interface with contracts, and then to a company via a contract.

Behind the scenes, Rails is pulling a nasty trick, one that should make any good database programmer feel sick. The contract model includes two fields (employable_id and employable_type), which point to a single row in a particular table. In some ways, this is sort of a poor man's foreign key. But the difference is that the foreign key can point to any of several tables. And, of course, there is no error checking; only the application can stop me from entering a random text string in the employable_type column.

So, now we can create some relationships:


xyz = Company.create(:name => 'XYZ Corporation')

p1 = Programmer.create(:first_name => 'Linus',
                       :last_name => 'Torvalds',
                       :email_address => 'torvalds@osdl.org',
                       :main_language => 'C')

Contract.create(:employable => p1, :company => xyz)

s1 = Secretary.create(:first_name => 'Condoleezza',
                      :last_name => 'Rice',
                      :email_address => 'rice@state.gov',
                      :words_per_minute => 90)

Contract.create(:employable => s1, :company => xyz)

That's already pretty remarkable. Because both programmers and secretaries are employable (as they both expose the employable interface to the contracts model, using has_many :as), we can join each of them to an instance of the contract model.

But, it gets better, if we add a few more associations:


class Contract < ActiveRecord::Base
  belongs_to :company
  belongs_to :employable, :polymorphic => true

  belongs_to :programmer,
    :class_name => 'Programmer', :foreign_key => 'employable_id'
  belongs_to :secretary,
    :class_name => 'Secretary', :foreign_key => 'employable_id'
end

class Company < ActiveRecord::Base
  has_many :contracts

  has_many :programmers, :through => :contracts,
           :source => :programmer,
           :conditions => "contracts.employable_type = 'Programmer' "

  has_many :secretaries, :through => :contracts,
           :source => :secretary,
           :conditions => "contracts.employable_type = 'Secretary' "

end

With this in place, we now have a complete bidirectional association between programmers and secretaries on one side and companies on the other. Thus, we can say:


>> xyz.programmers
=> [#<Programmer id: 1, main_language: "C", first_name: "Linus",
last_name: "Torvalds", email_address: "torvalds@osdl.org", created_at:
"2008-06-12 00:47:58", updated_at: "2008-06-12 00:47:58">]

>> xyz.secretaries
=> [#<Secretary id: 1, words_per_minute: 90, first_name:
"Condoleezza", last_name: "Rice", email_address: "rice@state.gov",
created_at: "2008-06-12 00:54:34", updated_at: "2008-06-12
00:54:34">]

But, we also can say:


>> Programmer.find(1).companies
=> [#<Company id: 1, name: "XYZ Corporation", created_at: "2008-06-12
    00:47:18", updated_at: "2008-06-12 00:47:18">]

Moreover, we can iterate over xyz.contracts, bringing together the secretaries and programmers models into one package:

>> xyz.contracts.each {|c| puts c.employable.first_name}
Linus
Condoleezza

Although Rails does not provide inheritance within the models, polymorphic associations make it possible to come close to such functionality. You also get a bunch of convenience functions that make it more natural to work with these additional attributes.

Conclusion

Not all data fits cleanly into two-dimensional tables. When this occurs, you can try to shoehorn your data into an inappropriate container. Or, you can try to use the help that is built in to one or more levels of your software stack. If you use PostgreSQL, inheritance can be really useful. If you use Rails, you can take advantage of polymorphic associations, allowing you to treat two or more models with a common API as similar. This isn't the sort of thing you'll do each day, but it's a useful skill to have on hand for cases when you need to take unusual data.

Resources

To learn how PostgreSQL allows for inheritance, read the on-line manual at www.postgresql.org/docs/8.3/static/ddl-inherit.html.

Rails Cookbook, by Rob Orsini, and published by O'Reilly, has some good information about polymorhphic associations.

The Rails Wiki has some good examples and descriptions of polymorhphic associations at wiki.rubyonrails.org/rails/pages/UnderstandingPolymorphicAssociations.

Reuven M. Lerner, a longtime Web/database developer and consultant, is a PhD candidate in learning sciences at Northwestern University, studying on-line learning communities. He recently returned (with his wife and three children) to their home in Modi'in, Israel, after four years in the Chicago area.

Load Disqus comments