At the Forge - Advanced MongoDB

 in
A look at some of MongoDB's features, such as indexing and object relationships.

Last month, I started discussing MongoDB, an open-source non-relational “document-based” database that has been growing in popularity during the past year. Unlike relational databases, which store all information in two-dimensional tables, MongoDB stores everything in something akin to a set of hash tables.

In a relational database, you can be sure that every record (that is, row) in a table has the same number and set of columns. By contrast, MongoDB is schema-less, meaning there is no enforcement of such rules on columns. Two records in a MongoDB collection might have identical keys, or they might have no two keys in common. Ensuring that the keys are meaningful, and that they will not be prone to abuse or error, is the programmer's responsibility.

Working with MongoDB turns out to be fairly straightforward, as I showed in several examples last month. Once you have set up a database and a collection, you can add, remove and modify records using a combination of objects in your favorite language and the MongoDB query language.

The fact that it's easy to work with MongoDB doesn't mean that it's lacking in high-powered features, however. This month, I describe some of the features you're likely to use if you incorporate MongoDB into your applications, such as indexing and object relationships. If you're like me, you'll see there is a lot to like; plus, using MongoDB prods you to think about your data in new and different ways.

Indexing

As I explained last month, MongoDB has its own query language, allowing you to retrieve records whose attributes match certain conditions. For example, if you have a book database, you might want to find all books with a certain title. One way to perform such a retrieval would be to iterate over each of the records, pulling out all those that precisely match the title in question. In Ruby, you could express this as:

books.find_all {|b| b.title == search_title}

The problem with this approach is that it's quite slow. The system needs to iterate over each of the items, which means as the list of books grows, so too will the time it takes to find what you're seeking.

The solution to this problem, as database programmers have long known, is to use an index. Indexes come in various forms, but the basic idea is that they allow you to find all records with a particular value for the title immediately (or any column field), without having to scan through each of the individual records. It should come as no surprise, then, that MongoDB supports indexes. How can you use them?

Continuing with this book example, I inserted about 43,000 books into a MongoDB collection. Each inserted document was a Ruby hash, storing the book's ISBN, title, weight and publication date. Then, I could retrieve a book using MongoDB's client program, which provides an interactive JavaScript interface:

   ./bin/mongo atf
> db.books.count()
   38202
> db.books.find({isbn:'9789810185060'})
   { "_id" : ObjectId("4b8fca3ef23f3c614600a8c2"),
     "title" : "Primary Mathematics 4A Textbook",
     "weight" : 40,
     "publication_date" : "2003-01-01",
     "isbn" : "9789810185060" }

The query certainly seems to execute quickly enough, but if there were millions of records, it would slow down quite a bit. You can give the database server a speed boost by adding an index on the isbn column:

> db.books.ensureIndex({isbn:1})

This creates an index on the isbn column in ascending order. You also could specify -1 (instead of 1) to indicate that the items should be indexed in descending order.

Just as a relational database automatically puts an index on the “primary key” column of a table, MongoDB automatically indexes the unique _id attribute on a collection. Every other index needs to be created manually. And indeed, now if you get a list of the indexes, you will see that not only is the isbn column indexed, but so is _id:

> db.books.getIndexes()
   [
       {
               "name" : "_id_",
               "ns" : "atf.books",
               "key" : {
                       "_id" : ObjectId("000000000000000000000000")
               }
       },
       {
               "ns" : "atf.books",
               "key" : {
                       "isbn" : 1
               },
               "name" : "isbn_1"
       }
   ]

Now you can perform the same query as before, requesting all of the books with a particular ISBN. You won't see any change in your result set; however, you should get a response more quickly than before.

You also can create a compound index, which looks at more than one key:

> db.books.ensureIndex({title:1, weight:1})

Perhaps it doesn't make sense to combine the index for a book's title with that of its weight. Nevertheless, that's what I have now done in the example. If you later decide you don't want this index, you can remove it with:

> db.books.dropIndex('title_1_weight_1')
   { "nIndexesWas" : 3, "ok" : 1 }

Because I'm using the JavaScript interface, the response is a JSON object, indicating that there used to be three indexes (and now there are only two), and that the function executed successfully. If you try to drop the index a second time, you'll get an error message:

> db.books.dropIndex('title_1_weight_1')
   { "errmsg" : "index not found", "ok" : 0 }

______________________

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState