Use Your Database!

Filtering

Let's say I'm interested in displaying all of the administrators on my system. Above, I showed that you can do that with:


>>> admins = Person.objects.filter(admin=True).all()
>>> for admin in admins:
        print(admin.username)

However, there's a variation of this that I've often seen people do:


>>> people = Person.objects.all()
>>> for person in people:
        if person.admin:
            print(person.username)

Notice what I'm doing here: I'm retrieving all of the objects and then iterating over them. Then, I use an "if" statement in Python to determine whether I want to print the user name. If you're used to working with Python objects, this seems like a perfectly natural thing to do.

However, let's consider what's actually happening here. You're retrieving all of the records and using only a small number of them. This means the database is being forced to read through all of its records, bring all of them into memory and send those records to the Python application—even though the odds are that only a small proportion of these records will be printed.

Moreover, while the "if" statement in Python definitely is quite efficient, there is still some overhead to the lookup of the person.admin attribute, not to mention the creation of a new "Person" object for each record you got back from the database. In other words, you're creating a huge number of Person objects just to display some output.

It's far, far more efficient to do your filtering in the database and create Python objects only for the records that you're most likely to want to display. The database, if defined correctly, has indexes that it can use to speed up the query if you tell it to filter records such that it consumes less memory, less CPU and less network bandwidth.

I've seen a variation on this anti-pattern in that people sometimes want to perform transformations on data that they have retrieved from the database. For example, let's assume that I want to apply a sales tax of 10% on all of the prices in a set of records. I certainly could say:


>>> products = Product.objects.all()
>>> for product in products:
        print(product.price * 1.10)

But it'll be faster and require less Python code, if I simply say:


>>> products = Product.objects.raw('select id, price * 1.10 as
>>> price_with_tax from store_product))
>>> for product in products:
        print(product.price_with_tax)

Notice how the use of raw allows you to go behind the back of Django's ORM, using whatever SQL you want. Is this the way you always want to do things? Surely not. But in specific cases, or when you want to use a function, it definitely can come in handy. Note that the object you get back from the call to raw() is a RawQuerySet, which is an iterator just like the regular QuerySet. However, it lacks an all() method, which is just as well, given that the RawQuerySet is already an iterator, giving the appropriate records when requested (and not before).

Note that for commonly used SQL functions, such as COUNT, there are built-in Django methods that handle such things. So if you're counting, sorting or grouping, you shouldn't need to step down to the SQL level. And as a general rule, you don't want to do that. However, there are times when it comes in handy—particularly if you're trying to reduce the amount of data you'll have to handle in Python.

Loops and Joins

The final anti-pattern is something I just saw at a client's office several days before writing this. The company has a large number of products and wants to perform a query for each of the products. So, they did something like this:


>>> products = Product.objects.all()
>>> for product in products:
        ProductInfo.objects.filter(product_id=product.id).all()

This query took a very long time to run. Why? Because for each of the thousands of products, they were then issuing an additional SQL query. The funny thing was that each individual query executed quickly, so it didn't show up in our PostgreSQL slow-query logging monitor. But the effect of executing such a query was dramatic and ended up taking many minutes.

The solution was to turn our many queries into a single query. In SQL, we would use an inner join. And indeed, when I used an inner join in raw SQL, we found that instead of taking several minutes to execute, it took 1.5 seconds—obviously, a huge time savings.

There are two possible solutions in Django for this problem. The first is to use a raw SQL query, as I showed above. That's not an ideal solution, particularly since the whole idea of an ORM is to remove the use of SQL and stay within a single language (Python, in this case). But there are times when you cannot avoid it.

However, if you want to be smarter about it, you can use Django's selected_related method. This allows you to retrieve not just one model, but a related model—in effect, creating a join in your database and producing one large query instead of many small ones. The effect on the performance of your application may well be dramatic in such a case, as I discovered when working with my client.

Conclusion

Object-relational mappers are wonderful things. However, at the end of the day, sometimes they can fool you into forgetting that there is a cost (in time and space) to bringing your data from the database into your language. Most modern frameworks try to help by using lazy-loading and iterators, such that you retrieve individual records and not the entire data set. However, it's all too easy to retrieve everything at once, or make your application work too hard, or even to invoke too many queries on your database.

Resources

The Django documentation is at https://docs.djangoproject.com. Look for the QuerySet documentation to see more about this subject.

If you're using Ruby on Rails, you should look at the documentation for ActiveRecord at http://rubyonrails.com. In particular, see the "lazy-loading" features that are now standard in ActiveRecord.

Finally, Pat Shaughnessy wrote a fantastic blog post on this subject, looking at Ruby on Rails and PostgreSQL. It's worth reading even if you don't use these specific technologies to understand the implications of bringing data out of the database. His piece is at http://patshaughnessy.net/2015/6/18/dont-let-your-data-out-of-the-database.

______________________

Reuven M. Lerner, Linux Journal Senior Columnist, a longtime Web developer, consultant and trainer, is completing his PhD in learning sciences at Northwestern University.