At the Forge - MongoDB

 in
A look at one of the best-known contenders in the non-relational database space.

Lately I've been teaching programming courses in both Python and Ruby, often to seasoned developers used to C++ and Java. Inevitably, the fact that Python and Ruby are dynamically typed languages, allowing any variable to contain any type of value, catches these students by surprise. They often are shocked to find that a given variable can, at any point in the program, be assigned to contain an integer, a string or an instance of an object, without any constraints. They wonder how it is that anyone could (or would) use such a language, given the possibility for runtime type errors. One of my jobs, as the instructor of this course, is to convince them that it is possible to work in such a language, but that doing so might require more adherence to conventions than they are used to.

So, it's ironic that during the last few months, as I have begun to experiment with non-relational databases, that I have found myself experiencing something akin to my students' shock. My long-standing beliefs about data integrity and what constitutes a reliable database have gone through a bit of a shake-up. I'm still a bit wary of these non-relational (or NoSQL) databases, and I'm far from convinced that the time has come to throw out SQL and the relational model in favor of something that is often easier to work with.

I do think, as I outlined in last month's column, that these databases offer a type of storage and retrieval that often is a more natural fit for many data-storage requirements. And, just as memcached offered an alternative storage system that complemented relational databases rather than replacing them, so too can these non-relational databases perform many useful functions that would be difficult with a relational database.

One of the best-known contenders in the non-relational database space is MongoDB. MongoDB is an open-source project, sponsored by New York-based 10gen (which intends to make money from licensing and support fees). It is written in C++, and there are drivers for all popular modern libraries. The software is licensed under the Affero GNU General Public License, which means if you modify the MongoDB source, and if those modifications are available on a publicly accessible Web site, you must distribute the source to your modifications. This is different from the standard GPL, which does not require that you divulge the source code to server-side applications with which people interact via a browser or other Internet client.

MongoDB has gained a large number of adherents because of its combination of features. It is easy to work with from a variety of languages, is extremely fast (written in C++), is actively supported by both a company and a large community and has proven itself to be stable in many situations and under high-stress conditions. It also includes a number of features for indexing and scaling that make it attractive.

MongoDB, like several of its competitors, describes itself as a document database. This does not mean it is a filesystem meant to store documents, but rather that it replaces the model of tables, rows and columns with that of “documents” consisting of one or more name-value pairs. I find it easier to think of documents as hash tables (or Python dictionaries), in which the keys are strings and the values can be just about anything. Each of these documents exists in a collection, and you can have one or more collections.

In many ways, you can think of MongoDB as an object database, because it allows you to store and retrieve items as objects, rather than force them into two-dimensional tables. However, this object database stores only basic object types—numbers, strings, lists and hashes, for example. Fortunately, these types can store a wide variety of data, flexibly and reliably, so this is not much of a concern.

Downloading and Installing

To download MongoDB, go to mongodb.org, and retrieve the version appropriate for your system. For my server running Ubuntu 8.10, I retrieved the 32-bit version of MongoDB 1.2.2. There is an option to retrieve a statically linked version, but the site itself indicates that this is a fallback, in case the dynamically linked version fails.

After unpacking the MongoDB server, create a directory in which it can store its data. By default, this is /data/db, which you can create with:

mkdir -p /data/db

Start the MongoDB server process with:

./bin/mongod

Now that you have a server running, you need to create a database. However, this step is unnecessary. If you try to connect to a database that has not yet been defined, MongoDB creates it for you. I tend to do most of my MongoDB work in Ruby, so I downloaded and installed the driver for Ruby from GitHub and started up the interactive Ruby interpreter, irb. Then, I typed:

irb(main):001:0> require 'rubygems'
irb(main):002:0> require 'mongo'

With the MongDB driver loaded, I was able to connect to the already-running server, creating an “atf” database:

irb(main):005:0> db = Mongo::Connection.new.db("atf")

After this, db is an instance of the Mongo::DB class, representing a MongoDB database. Each database may contain any number of collections, analogous to tables in a relational database. By default, this example database contains no collections, as you can see with this small snippet of code:

irb(main):008:0> db.collection_names.each { |name| puts name }
=> [ ]

The return value of an empty list shows that the database is currently empty.

You can create a new collection by invoking the collection method on your database connection:

irb(main):012:0> c = db.collection("stuff")

Once you have created your collection, you also can see that MongoDB has silently created a second collection, named system.indexes, used for indexing the contents:

irb(main):032:0> db.collection_names
=> ["stuff", "system.indexes"]

Because MongoDB is a schema-less database, you can begin to store items to your collection immediately, without defining its columns or data types. In practice, this means you can store hashes with any keys and values that you choose. For example, you can use the insert method to add a new item to your collection:

irb(main):017:0> c.insert({:a => 1, :b => 2})
=> 4b6fe8983c1c7d6a6a000001

The return value is the unique ID for this document (or object) that has just been stored. You can ask the collection to show what you have stored by invoking its find_one method:

irb(main):021:0> c.find_one
=> {"_id"=>4b6fe8983c1c7d6a6a000001, "a"=>1, "b"=>2}

Notice that two things have happened here. First, the keys have been turned from Ruby symbols into strings. Indeed, MongoDB requires that all keys be strings; because symbols are used so pervasively in the Ruby world for hash keys, they are translated into strings silently if you use them.

Second, you can see that another key, named _id, has been added to the document, and its value matches the return value that you received with your first insert.

You can ask the collection to tell how many documents it contains with the count method:

irb(main):026:0> c.count
=> 1

As you might expect, you can store and retrieve data using any number of different languages. Although you are likely to work in a single language, MongoDB (like relational databases) doesn't care what language you use and lets you mix and match them freely.

In the above examples, I used Ruby to store data. I should be able to retrieve this data using Python, as follows:

>>> import pymongo
>>> from pymongo import Connection
>>> connection = Connection()
>>> db = connection.atf
>>> db.collection_names()
   [u'stuff', u'system.indexes']
>>> c = db.stuff

>>> c
   Collection(Database(Connection('localhost', 27017), u'atf'), 
 ↪u'stuff')

>>> c.find_one()
   {u'a': 1, u'_id': ObjectId('4b6fe8983c1c7d6a6a000001'), u'b': 2}

The only surprises here are probably that the strings are all stored as Unicode, represented with the u'' syntax in Python 2.6 (which I am using here). Also, the document ID, with the key of _id, still is there, but is an object, rather than a string.

You also can see that the MongoDB developers have gone to great efforts to keep the APIs similar across different languages. This means if you work in more than one language, you likely will be able to depend on similar (or identical) method names to perform the same task.

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix