At the Forge - Advanced MongoDB

 in
A look at some of MongoDB's features, such as indexing and object relationships.
Enforcing Uniqueness

Indexes not only speed up many queries, but they also allow you to ensure uniqueness. That is, if you want to be sure that a particular attribute is unique across all the documents in a collection, you can define the index with the “unique” parameter.

For example, let's get a record from the current collection:

> db.books.findOne()
   {
      "_id" : ObjectId("4b8fc9baf23f3c6146000b90"),
      "title" : "\"Gateways to Academic Writing: Effective Sentences,
                   Paragraphs, and Essays\"",
      "weight" : 0,
      "publication_date" : "2004-02-01",
      "isbn" : "0131408887"
   }

If you try to insert a new document with the same ISBN, MongoDB won't care:

> db.books.save({isbn:'0131408887', title:'fake book'})

But in theory, there should be only one book with each ISBN. This means the database can (and should) have a uniqueness constraint on ISBN. You can achieve this by dropping and re-creating your index, indicating that the new version of the index also should enforce uniqueness:

> db.books.dropIndex("isbn_1")
   { "nIndexesWas" : 2, "ok" : 1 }
> db.books.ensureIndex({isbn:1}, {unique:true})
   E11000 duplicate key errorindex: atf.books.$isbn_1  
   ↪dup key: { : "0131408887" }

Uh-oh. It turns out that there are some duplicate ISBNs in the database already. The good news is that MongoDB shows which key is the offender. Thus, you could go through the database (either manually or automatically, depending on the size of the data set) and remove this key, re-try to create the index, and so on, until everything works. Or, you can tell the ensureIndex function that it should drop any duplicate records.

Yes, you read that correctly. MongoDB will, if you ask it to, not only create a unique index, but also drop anything that would cause that constraint to be violated. I'm pretty sure I would not want to use this function on actual production data, just because it scares me to think that my database would be removing data. But in this example case, with a toy dataset, it works just fine:

> db.books.ensureIndex({isbn:1}, {unique:true, dropDups:true})
   E11000 duplicate key errorindex: atf.books.$isbn_1  
   ↪dup key: { : "0131408887" }

Now, what happens if you try to insert a non-unique ISBN again?

> db.books.save({isbn:'0131408887', title:'fake book'})
   E11000 duplicate key errorindex: atf.books.$isbn_1  
   ↪dup key: { : "0131408887" }

You may have as many indexes as you want on a collection. Like with a relational database, the main cost of an index is obvious when you insert or update data, so if you expect to insert or update your documents a great deal, you should carefully consider how many indexes you want to create.

A second, and more subtle, issue (referenced in David Mytton's blog post—see Resources) is that there is a namespace limit in each MongoDB database, and that this namespace is used by both collections and indexes.

Combining Objects

One of the touted advantages of an object database—or a “document” database, as MongoDB describes itself—is that you can store just about anything inside it, without the “impedance mismatch” that exists when storing objects in a relational database's two-dimensional tables. So if your object contains a few strings, a few dates and a few integers, you should be just fine.

However, many situations exist in which this is not quite enough. One classic example (discussed in many MongoDB FAQs and interviews) is that of a blog. It makes sense to have a collection of blog posts, and for each post to have a date, a title and a body. But, you'll also need an author, and assuming that you want to store more than just the author's name, or another simple text string, you probably will want to have each author stored as an object.

So, how can you do that? The simplest way is to store an object along with each blog post. If you have used a high-level language, such as Ruby or Python before, this won't come as a surprise; you're just sticking a hash inside a hash (or if you're a Python hacker, then a dict inside of a dict). So, in the JavaScript client, you can say:

> db.blogposts.save({title:'title',
                        body:'this is the body',
                        author:{name:'Reuven', 
                        ↪email:'reuven@lerner.co.il'} })

Remember, MongoDB creates a collection for you if it doesn't exist already. Then, you can retrieve your post with:

> db.blogposts.findOne()
   {
           "_id" : ObjectId("4b91070a9640ce564dbe5a35"),
           "title" : "title",
           "body" : "this is the body",
           "author" : {
                   "name" : "Reuven",
                   "email" : "reuven@lerner.co.il"
           }
   }

Or, you can retrieve the e-mail address of that author with:

> db.blogposts.findOne()['author']['email']
   reuven@lerner.co.il

Or, you even can search:

> db.blogposts.findOne({title:'titleee'})
   null

In other words, no postings matched the search criteria.

Now, if you have worked with relational databases for any length of time, you probably are thinking, “Wait a second. Is he saying I should store an identical author object with each posting that the author made?” And the answer is yes—something that I admit gives me the heebie-jeebies. MongoDB, like many other document databases, does not require or even expect that you will normalize your data—the opposite of what you would do with a relational database.

The advantages of a non-normalized approach are that it's easy to work with in general and is much faster. The disadvantage, as everyone who ever has studied normalization knows, is that if you need to update the author's e-mail address, you need to iterate over all the entries in your collection—an expensive task in many cases. In addition, there's always the chance that different blog postings will spell the same author's name in different ways, leading to problems with data integrity.

If there is one issue that gives me pause when working with MongoDB, it is this one—the fact that the data isn't normalized goes against everything that I've done over the years. I'm not sure whether my reaction indicates that I need to relax about this issue, choose MongoDB only for particularly appropriate tasks, or if I'm a dinosaur.

MongoDB does offer a partial solution. Instead of embedding an object within another object, you can enter a reference to another object, either in the same collection or in another collection. For example, you can create a new “authors” collection in your database, and then create a new author:

> db.authors.save({name:'Reuven', email:'reuven@lerner.co.il'})

> a = db.authors.findOne()
   {
           "_id" : ObjectId("4b910a469640ce564dbe5a36"),
           "name" : "Reuven",
           "email" : "reuven@lerner.co.il"
   }

Now you can assign this author to your blog post, replacing the object literal from before:

> p = db.blogposts.findOne()
> p['author'] = a

> p
   {
           "_id" : ObjectId("4b91070a9640ce564dbe5a35"),
           "title" : "title",
           "body" : "this is the body",
           "author" : {
                   "_id" : ObjectId("4b910a469640ce564dbe5a36"),
                   "name" : "Reuven",
                   "email" : "reuven@lerner.co.il"
           }
   }

Although the blog post looks similar to what you had before, notice that it now has its own “_id” attribute. This shows that you are referencing another object in MongoDB. Changes to that object are immediately reflected, as you can see here:

> a['name'] = 'Reuven Lerner'
   Reuven Lerner
> p
   {
           "_id" : ObjectId("4b91070a9640ce564dbe5a35"),
           "title" : "title",
           "body" : "this is the body",
           "author" : {
                   "_id" : ObjectId("4b910a469640ce564dbe5a36"),
                   "name" : "Reuven Lerner",
                   "email" : "reuven@lerner.co.il"
           }
   }

See how the author's “name” attribute was updated immediately? That's because you have an object reference here, rather than an embedded object.

Given the ease with which you can reference objects from other objects, why not do this all the time? To be honest, this is definitely my preference, perhaps reflecting my years of work with relational databases. MongoDB's authors, by contrast, indicate that the main problem with this approach is that it requires additional reads from the database, which slows down the data-retrieval process. You will have to decide what trade-offs are appropriate for your needs, both now and in the future.

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix