At the Forge - Advanced MongoDB

by Reuven M. Lerner

Last month, I started discussing MongoDB, an open-source non-relational “document-based” database that has been growing in popularity during the past year. Unlike relational databases, which store all information in two-dimensional tables, MongoDB stores everything in something akin to a set of hash tables.

In a relational database, you can be sure that every record (that is, row) in a table has the same number and set of columns. By contrast, MongoDB is schema-less, meaning there is no enforcement of such rules on columns. Two records in a MongoDB collection might have identical keys, or they might have no two keys in common. Ensuring that the keys are meaningful, and that they will not be prone to abuse or error, is the programmer's responsibility.

Working with MongoDB turns out to be fairly straightforward, as I showed in several examples last month. Once you have set up a database and a collection, you can add, remove and modify records using a combination of objects in your favorite language and the MongoDB query language.

The fact that it's easy to work with MongoDB doesn't mean that it's lacking in high-powered features, however. This month, I describe some of the features you're likely to use if you incorporate MongoDB into your applications, such as indexing and object relationships. If you're like me, you'll see there is a lot to like; plus, using MongoDB prods you to think about your data in new and different ways.

Indexing

As I explained last month, MongoDB has its own query language, allowing you to retrieve records whose attributes match certain conditions. For example, if you have a book database, you might want to find all books with a certain title. One way to perform such a retrieval would be to iterate over each of the records, pulling out all those that precisely match the title in question. In Ruby, you could express this as:

books.find_all {|b| b.title == search_title}

The problem with this approach is that it's quite slow. The system needs to iterate over each of the items, which means as the list of books grows, so too will the time it takes to find what you're seeking.

The solution to this problem, as database programmers have long known, is to use an index. Indexes come in various forms, but the basic idea is that they allow you to find all records with a particular value for the title immediately (or any column field), without having to scan through each of the individual records. It should come as no surprise, then, that MongoDB supports indexes. How can you use them?

Continuing with this book example, I inserted about 43,000 books into a MongoDB collection. Each inserted document was a Ruby hash, storing the book's ISBN, title, weight and publication date. Then, I could retrieve a book using MongoDB's client program, which provides an interactive JavaScript interface:

   ./bin/mongo atf
> db.books.count()
   38202
> db.books.find({isbn:'9789810185060'})
   { "_id" : ObjectId("4b8fca3ef23f3c614600a8c2"),
     "title" : "Primary Mathematics 4A Textbook",
     "weight" : 40,
     "publication_date" : "2003-01-01",
     "isbn" : "9789810185060" }

The query certainly seems to execute quickly enough, but if there were millions of records, it would slow down quite a bit. You can give the database server a speed boost by adding an index on the isbn column:

> db.books.ensureIndex({isbn:1})

This creates an index on the isbn column in ascending order. You also could specify -1 (instead of 1) to indicate that the items should be indexed in descending order.

Just as a relational database automatically puts an index on the “primary key” column of a table, MongoDB automatically indexes the unique _id attribute on a collection. Every other index needs to be created manually. And indeed, now if you get a list of the indexes, you will see that not only is the isbn column indexed, but so is _id:

> db.books.getIndexes()
   [
       {
               "name" : "_id_",
               "ns" : "atf.books",
               "key" : {
                       "_id" : ObjectId("000000000000000000000000")
               }
       },
       {
               "ns" : "atf.books",
               "key" : {
                       "isbn" : 1
               },
               "name" : "isbn_1"
       }
   ]

Now you can perform the same query as before, requesting all of the books with a particular ISBN. You won't see any change in your result set; however, you should get a response more quickly than before.

You also can create a compound index, which looks at more than one key:

> db.books.ensureIndex({title:1, weight:1})

Perhaps it doesn't make sense to combine the index for a book's title with that of its weight. Nevertheless, that's what I have now done in the example. If you later decide you don't want this index, you can remove it with:

> db.books.dropIndex('title_1_weight_1')
   { "nIndexesWas" : 3, "ok" : 1 }

Because I'm using the JavaScript interface, the response is a JSON object, indicating that there used to be three indexes (and now there are only two), and that the function executed successfully. If you try to drop the index a second time, you'll get an error message:

> db.books.dropIndex('title_1_weight_1')
   { "errmsg" : "index not found", "ok" : 0 }
Enforcing Uniqueness

Indexes not only speed up many queries, but they also allow you to ensure uniqueness. That is, if you want to be sure that a particular attribute is unique across all the documents in a collection, you can define the index with the “unique” parameter.

For example, let's get a record from the current collection:

> db.books.findOne()
   {
      "_id" : ObjectId("4b8fc9baf23f3c6146000b90"),
      "title" : "\"Gateways to Academic Writing: Effective Sentences,
                   Paragraphs, and Essays\"",
      "weight" : 0,
      "publication_date" : "2004-02-01",
      "isbn" : "0131408887"
   }

If you try to insert a new document with the same ISBN, MongoDB won't care:

> db.books.save({isbn:'0131408887', title:'fake book'})

But in theory, there should be only one book with each ISBN. This means the database can (and should) have a uniqueness constraint on ISBN. You can achieve this by dropping and re-creating your index, indicating that the new version of the index also should enforce uniqueness:

> db.books.dropIndex("isbn_1")
   { "nIndexesWas" : 2, "ok" : 1 }
> db.books.ensureIndex({isbn:1}, {unique:true})
   E11000 duplicate key errorindex: atf.books.$isbn_1  
   ↪dup key: { : "0131408887" }

Uh-oh. It turns out that there are some duplicate ISBNs in the database already. The good news is that MongoDB shows which key is the offender. Thus, you could go through the database (either manually or automatically, depending on the size of the data set) and remove this key, re-try to create the index, and so on, until everything works. Or, you can tell the ensureIndex function that it should drop any duplicate records.

Yes, you read that correctly. MongoDB will, if you ask it to, not only create a unique index, but also drop anything that would cause that constraint to be violated. I'm pretty sure I would not want to use this function on actual production data, just because it scares me to think that my database would be removing data. But in this example case, with a toy dataset, it works just fine:

> db.books.ensureIndex({isbn:1}, {unique:true, dropDups:true})
   E11000 duplicate key errorindex: atf.books.$isbn_1  
   ↪dup key: { : "0131408887" }

Now, what happens if you try to insert a non-unique ISBN again?

> db.books.save({isbn:'0131408887', title:'fake book'})
   E11000 duplicate key errorindex: atf.books.$isbn_1  
   ↪dup key: { : "0131408887" }

You may have as many indexes as you want on a collection. Like with a relational database, the main cost of an index is obvious when you insert or update data, so if you expect to insert or update your documents a great deal, you should carefully consider how many indexes you want to create.

A second, and more subtle, issue (referenced in David Mytton's blog post—see Resources) is that there is a namespace limit in each MongoDB database, and that this namespace is used by both collections and indexes.

Combining Objects

One of the touted advantages of an object database—or a “document” database, as MongoDB describes itself—is that you can store just about anything inside it, without the “impedance mismatch” that exists when storing objects in a relational database's two-dimensional tables. So if your object contains a few strings, a few dates and a few integers, you should be just fine.

However, many situations exist in which this is not quite enough. One classic example (discussed in many MongoDB FAQs and interviews) is that of a blog. It makes sense to have a collection of blog posts, and for each post to have a date, a title and a body. But, you'll also need an author, and assuming that you want to store more than just the author's name, or another simple text string, you probably will want to have each author stored as an object.

So, how can you do that? The simplest way is to store an object along with each blog post. If you have used a high-level language, such as Ruby or Python before, this won't come as a surprise; you're just sticking a hash inside a hash (or if you're a Python hacker, then a dict inside of a dict). So, in the JavaScript client, you can say:

> db.blogposts.save({title:'title',
                        body:'this is the body',
                        author:{name:'Reuven', 
                        ↪email:'reuven@lerner.co.il'} })

Remember, MongoDB creates a collection for you if it doesn't exist already. Then, you can retrieve your post with:

> db.blogposts.findOne()
   {
           "_id" : ObjectId("4b91070a9640ce564dbe5a35"),
           "title" : "title",
           "body" : "this is the body",
           "author" : {
                   "name" : "Reuven",
                   "email" : "reuven@lerner.co.il"
           }
   }

Or, you can retrieve the e-mail address of that author with:

> db.blogposts.findOne()['author']['email']
   reuven@lerner.co.il

Or, you even can search:

> db.blogposts.findOne({title:'titleee'})
   null

In other words, no postings matched the search criteria.

Now, if you have worked with relational databases for any length of time, you probably are thinking, “Wait a second. Is he saying I should store an identical author object with each posting that the author made?” And the answer is yes—something that I admit gives me the heebie-jeebies. MongoDB, like many other document databases, does not require or even expect that you will normalize your data—the opposite of what you would do with a relational database.

The advantages of a non-normalized approach are that it's easy to work with in general and is much faster. The disadvantage, as everyone who ever has studied normalization knows, is that if you need to update the author's e-mail address, you need to iterate over all the entries in your collection—an expensive task in many cases. In addition, there's always the chance that different blog postings will spell the same author's name in different ways, leading to problems with data integrity.

If there is one issue that gives me pause when working with MongoDB, it is this one—the fact that the data isn't normalized goes against everything that I've done over the years. I'm not sure whether my reaction indicates that I need to relax about this issue, choose MongoDB only for particularly appropriate tasks, or if I'm a dinosaur.

MongoDB does offer a partial solution. Instead of embedding an object within another object, you can enter a reference to another object, either in the same collection or in another collection. For example, you can create a new “authors” collection in your database, and then create a new author:

> db.authors.save({name:'Reuven', email:'reuven@lerner.co.il'})

> a = db.authors.findOne()
   {
           "_id" : ObjectId("4b910a469640ce564dbe5a36"),
           "name" : "Reuven",
           "email" : "reuven@lerner.co.il"
   }

Now you can assign this author to your blog post, replacing the object literal from before:

> p = db.blogposts.findOne()
> p['author'] = a

> p
   {
           "_id" : ObjectId("4b91070a9640ce564dbe5a35"),
           "title" : "title",
           "body" : "this is the body",
           "author" : {
                   "_id" : ObjectId("4b910a469640ce564dbe5a36"),
                   "name" : "Reuven",
                   "email" : "reuven@lerner.co.il"
           }
   }

Although the blog post looks similar to what you had before, notice that it now has its own “_id” attribute. This shows that you are referencing another object in MongoDB. Changes to that object are immediately reflected, as you can see here:

> a['name'] = 'Reuven Lerner'
   Reuven Lerner
> p
   {
           "_id" : ObjectId("4b91070a9640ce564dbe5a35"),
           "title" : "title",
           "body" : "this is the body",
           "author" : {
                   "_id" : ObjectId("4b910a469640ce564dbe5a36"),
                   "name" : "Reuven Lerner",
                   "email" : "reuven@lerner.co.il"
           }
   }

See how the author's “name” attribute was updated immediately? That's because you have an object reference here, rather than an embedded object.

Given the ease with which you can reference objects from other objects, why not do this all the time? To be honest, this is definitely my preference, perhaps reflecting my years of work with relational databases. MongoDB's authors, by contrast, indicate that the main problem with this approach is that it requires additional reads from the database, which slows down the data-retrieval process. You will have to decide what trade-offs are appropriate for your needs, both now and in the future.

Conclusion

MongoDB is an impressive database, with extensive documentation and drivers. It is easy to begin working with MongoDB, and the interactive shell is straightforward for anyone with even a bit of JavaScript and database experience. Indexes are fairly easy to understand, create and apply.

Where things get tricky, and even sticky, is precisely in the area where relational databases have excelled (and have been optimized) for decades—namely, the interactions and associations among related objects, ensuring data integrity without compromising speed too much. I'm sure MongoDB will continue to improve in this area, but for now, this is the main thing that bothers me about MongoDB. Nevertheless, I've been impressed by what I've seen so far, and I easily can imagine using it on some of my future projects, especially those that will have a limited number of cross-collection references.

Resources

The main site for MongoDB, including source code and documentation, is at mongodb.org. A reference guide to the interactive, JavaScript-based shell is at www.mongodb.org/display/DOCS/dbshell+Reference.

For an excellent introduction to MongoDB, including some corporate background on 10gen and how it can be used in your applications, listen to episode 105 of the “FLOSS Weekly” podcast. I found the podcast to be both entertaining and informative.

Another good introduction is from John Nunemaker, a well-known blogger in the Ruby world: railstips.org/blog/archives/2009/06/03/what-if-a-key-value-store-mated-with-a-relational-database-system.

Mathias Meyer wrote a terrific introduction and description of MongoDB on his blog: www.paperplanes.de/2010/2/25/notes_on_mongodb.html.

Because MongoDB is a “document” database, you might be wondering if if there is any way to generate a full-text index on a document. The answer is “kind of”, with more information and hints available at www.mongodb.org/display/DOCS/Full+Text+Search+in+Mongo.

Finally, David Mytton recently wrote a blog post, in which he described some of the issues he encountered when using MongoDB in a production environment: blog.boxedice.com/2010/02/28/notes-from-a-production-mongodb-deployment.

Reuven M. Lerner is a longtime Web developer, trainer, and consultant. He is a PhD candidate in Learning Sciences at Northwestern University. Reuven lives with his wife and three children in Modi'in, Israel.

Load Disqus comments