Use Your Database!

I love high-level, dynamically typed languages, such as Python, Ruby and JavaScript. They're easy—and even fun—to use. They let me express myself richly, and they lend themselves to code that easily can be reused and maintained. It's no surprise that interest in such languages is on the rise, especially when creating Web applications.

Now, one of the downsides of these languages is that they tend to execute more slowly than static languages, such as Java, C# and Go. But for a very large number of Web applications, this speed difference doesn't matter, or it is justified by the productivity gain enjoyed by the engineers, or it can be (somewhat) handled by throwing hardware at the problem.

However, the fact that dynamic languages run more slowly than their static counterparts doesn't mean you want to ignore the speed issue completely. Once you know your way around dynamic languages, as well the frameworks built in them, you get a sense of what runs quickly and what doesn't.

In just the past few weeks, however, I've encountered a pattern—or perhaps I should say, an "anti-pattern"—in the code that several of my consulting clients had written. This anti-pattern is well known to experienced developers, but it seems to be less well known than I would have hoped or expected. That anti-pattern, stated simply, is that you should have the database do as much work as possible.

There are several reasons for throwing as much as possible at your database server. First and foremost, your database almost certainly is written in C, so it's likely to execute more quickly than your high-level, dynamic code.

Second, your database has been highly optimized through the years, such that retrieving data from it has been tuned to take into account memory, disk and the frequency of the retrievals.

Third, although network bandwidth is cheap nowadays, it's not infinitely fast. This means that although you could, in theory, write a database query in Ruby that returns a large number of rows and then filters through them using Enumerate#map, if you have the database do some of this for you, it can reduce the amount of data you're retrieving dramatically and, thus, lead to faster application responses and less network usage.

So in this article, I explore this anti-pattern of doing work in an application that probably should be done in the database. You'll see how you can get the same results, but much faster, by applying this rule. There's obviously no one right way to do things, but having the database do as much work as possible is likely to make your applications faster and easier to maintain.

Don't Load Everything

High-level languages, and most high-level Web frameworks, don't encourage you to write SQL directly. Rather, you use objects and methods to work with the database; the methods you invoke are translated into SQL by an ORM (object-relational mapper). Part of the reason, I believe, for the widespread inefficiencies in people's database queries is that they don't see the SQL they're writing, so they don't understand some of the implications of their method calls.

For example, say I'm working on a project in Django. If I have a model named Person, I can (and should) invoke the "objects" method in order to work with the corresponding table in the database. I then can take the resulting object and apply additional filters, getting (for example) the records corresponding to people who are system administrators:

>>> admins = Person.objects.filter(admin=True).all()

Once I've done that, "admins" will contain a set of records, known in the Django world as a "QuerySet". But in actuality, a QuerySet doesn't contain the records themselves. Rather, it serves as a go between to the database. If you iterate over the QuerySet, you'll get each of the records, one by one.

Thus, even if you'll eventually get one million records back from the database, the above code doesn't retrieve them. You can get the records, one by one, by iterating over the result set. For example, the following will display the user names for all of the administrators:

>>> for admin in admins:

This is the right way to work with objects in Django. Although it might seem weird not to have the entire result set in memory, the implications are tremendous. You don't need to worry about using up all of the server's memory if the resulting records will be too large.

Working with iterators is easy and straightforward, if you're used to it. If you're not, it might seem strange not to have the entire result set at once, and to iterate over it. Moreover, all you need is the right combination of a result set and the following code:

>>> admins = list(Person.objects.filter(admin=True).all())

Notice what I've changed in the above assignment? I'm no longer asking for the QuerySet, over which I can iterate. Rather, I've asked for the QuerySet's data to be used to create a list and then assigned to "admins". If you have one million records in your result set, this is going to consume a fairly large amount of memory.

It's true that this can be necessary, at times, but those times are fairly rare. After all, the odds are pretty good that you're retrieving the records in order to display them to users, something that is easily and efficiently accomplished with iteration.


Reuven M. Lerner, Linux Journal Senior Columnist, a longtime Web developer, consultant and trainer, is completing his PhD in learning sciences at Northwestern University.