At the Forge - Advanced MongoDB

 in
A look at some of MongoDB's features, such as indexing and object relationships.
Enforcing Uniqueness

Indexes not only speed up many queries, but they also allow you to ensure uniqueness. That is, if you want to be sure that a particular attribute is unique across all the documents in a collection, you can define the index with the “unique” parameter.

For example, let's get a record from the current collection:

> db.books.findOne()
   {
      "_id" : ObjectId("4b8fc9baf23f3c6146000b90"),
      "title" : "\"Gateways to Academic Writing: Effective Sentences,
                   Paragraphs, and Essays\"",
      "weight" : 0,
      "publication_date" : "2004-02-01",
      "isbn" : "0131408887"
   }

If you try to insert a new document with the same ISBN, MongoDB won't care:

> db.books.save({isbn:'0131408887', title:'fake book'})

But in theory, there should be only one book with each ISBN. This means the database can (and should) have a uniqueness constraint on ISBN. You can achieve this by dropping and re-creating your index, indicating that the new version of the index also should enforce uniqueness:

> db.books.dropIndex("isbn_1")
   { "nIndexesWas" : 2, "ok" : 1 }
> db.books.ensureIndex({isbn:1}, {unique:true})
   E11000 duplicate key errorindex: atf.books.$isbn_1  
   ↪dup key: { : "0131408887" }

Uh-oh. It turns out that there are some duplicate ISBNs in the database already. The good news is that MongoDB shows which key is the offender. Thus, you could go through the database (either manually or automatically, depending on the size of the data set) and remove this key, re-try to create the index, and so on, until everything works. Or, you can tell the ensureIndex function that it should drop any duplicate records.

Yes, you read that correctly. MongoDB will, if you ask it to, not only create a unique index, but also drop anything that would cause that constraint to be violated. I'm pretty sure I would not want to use this function on actual production data, just because it scares me to think that my database would be removing data. But in this example case, with a toy dataset, it works just fine:

> db.books.ensureIndex({isbn:1}, {unique:true, dropDups:true})
   E11000 duplicate key errorindex: atf.books.$isbn_1  
   ↪dup key: { : "0131408887" }

Now, what happens if you try to insert a non-unique ISBN again?

> db.books.save({isbn:'0131408887', title:'fake book'})
   E11000 duplicate key errorindex: atf.books.$isbn_1  
   ↪dup key: { : "0131408887" }

You may have as many indexes as you want on a collection. Like with a relational database, the main cost of an index is obvious when you insert or update data, so if you expect to insert or update your documents a great deal, you should carefully consider how many indexes you want to create.

A second, and more subtle, issue (referenced in David Mytton's blog post—see Resources) is that there is a namespace limit in each MongoDB database, and that this namespace is used by both collections and indexes.

Combining Objects

One of the touted advantages of an object database—or a “document” database, as MongoDB describes itself—is that you can store just about anything inside it, without the “impedance mismatch” that exists when storing objects in a relational database's two-dimensional tables. So if your object contains a few strings, a few dates and a few integers, you should be just fine.

However, many situations exist in which this is not quite enough. One classic example (discussed in many MongoDB FAQs and interviews) is that of a blog. It makes sense to have a collection of blog posts, and for each post to have a date, a title and a body. But, you'll also need an author, and assuming that you want to store more than just the author's name, or another simple text string, you probably will want to have each author stored as an object.

So, how can you do that? The simplest way is to store an object along with each blog post. If you have used a high-level language, such as Ruby or Python before, this won't come as a surprise; you're just sticking a hash inside a hash (or if you're a Python hacker, then a dict inside of a dict). So, in the JavaScript client, you can say:

> db.blogposts.save({title:'title',
                        body:'this is the body',
                        author:{name:'Reuven', 
                        ↪email:'reuven@lerner.co.il'} })

Remember, MongoDB creates a collection for you if it doesn't exist already. Then, you can retrieve your post with:

> db.blogposts.findOne()
   {
           "_id" : ObjectId("4b91070a9640ce564dbe5a35"),
           "title" : "title",
           "body" : "this is the body",
           "author" : {
                   "name" : "Reuven",
                   "email" : "reuven@lerner.co.il"
           }
   }

Or, you can retrieve the e-mail address of that author with:

> db.blogposts.findOne()['author']['email']
   reuven@lerner.co.il

Or, you even can search:

> db.blogposts.findOne({title:'titleee'})
   null

In other words, no postings matched the search criteria.

Now, if you have worked with relational databases for any length of time, you probably are thinking, “Wait a second. Is he saying I should store an identical author object with each posting that the author made?” And the answer is yes—something that I admit gives me the heebie-jeebies. MongoDB, like many other document databases, does not require or even expect that you will normalize your data—the opposite of what you would do with a relational database.

The advantages of a non-normalized approach are that it's easy to work with in general and is much faster. The disadvantage, as everyone who ever has studied normalization knows, is that if you need to update the author's e-mail address, you need to iterate over all the entries in your collection—an expensive task in many cases. In addition, there's always the chance that different blog postings will spell the same author's name in different ways, leading to problems with data integrity.

If there is one issue that gives me pause when working with MongoDB, it is this one—the fact that the data isn't normalized goes against everything that I've done over the years. I'm not sure whether my reaction indicates that I need to relax about this issue, choose MongoDB only for particularly appropriate tasks, or if I'm a dinosaur.

MongoDB does offer a partial solution. Instead of embedding an object within another object, you can enter a reference to another object, either in the same collection or in another collection. For example, you can create a new “authors” collection in your database, and then create a new author:

> db.authors.save({name:'Reuven', email:'reuven@lerner.co.il'})

> a = db.authors.findOne()
   {
           "_id" : ObjectId("4b910a469640ce564dbe5a36"),
           "name" : "Reuven",
           "email" : "reuven@lerner.co.il"
   }

Now you can assign this author to your blog post, replacing the object literal from before:

> p = db.blogposts.findOne()
> p['author'] = a

> p
   {
           "_id" : ObjectId("4b91070a9640ce564dbe5a35"),
           "title" : "title",
           "body" : "this is the body",
           "author" : {
                   "_id" : ObjectId("4b910a469640ce564dbe5a36"),
                   "name" : "Reuven",
                   "email" : "reuven@lerner.co.il"
           }
   }

Although the blog post looks similar to what you had before, notice that it now has its own “_id” attribute. This shows that you are referencing another object in MongoDB. Changes to that object are immediately reflected, as you can see here:

> a['name'] = 'Reuven Lerner'
   Reuven Lerner
> p
   {
           "_id" : ObjectId("4b91070a9640ce564dbe5a35"),
           "title" : "title",
           "body" : "this is the body",
           "author" : {
                   "_id" : ObjectId("4b910a469640ce564dbe5a36"),
                   "name" : "Reuven Lerner",
                   "email" : "reuven@lerner.co.il"
           }
   }

See how the author's “name” attribute was updated immediately? That's because you have an object reference here, rather than an embedded object.

Given the ease with which you can reference objects from other objects, why not do this all the time? To be honest, this is definitely my preference, perhaps reflecting my years of work with relational databases. MongoDB's authors, by contrast, indicate that the main problem with this approach is that it requires additional reads from the database, which slows down the data-retrieval process. You will have to decide what trade-offs are appropriate for your needs, both now and in the future.

______________________

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState