At the Forge - CouchDB

by Reuven M. Lerner

The surge in interest in non-relational databases—often known collectively as NoSQL—is now impossible for Web developers to ignore. Whether you are looking at a non-relational database for reasons of scalability, availability, cost, performance or just because it's a shiny new toy, there's no denying that serious Web developers need to consider non-relational options when designing an application. In the past few months, every project on which I've worked has at least considered a non-relational solution, even when the final decision was to use a relational database.

In the previous two installments of this column, I looked at MongoDB, an object (or “document”) database with a somewhat relational feel. MongoDB stores objects, but its query language should look somewhat familiar to those of you who have long used relational databases. If you're willing to consider a more radical departure from the world of relational databases and query syntax, instead of using the map-reduce paradigm, easy replication and a straightforward RESTful API, you might want to consider CouchDB, now part of the Apache Software Foundation. Even if you don't use CouchDB in production environments, you may find (as I do) that its use of JavaScript, coupled with its implementation of map-reduce, helps you think in new and different ways about old problems.

CouchDB Basics

Downloading and installing CouchDB is extremely easy. If it's not available via a simple apt-get install (or the yum equivalent) for your system, or if you simply prefer to install a source version, you can download it from the CouchDB home page at couchdb.apache.org. The version I'm running is slightly out of date (0.10), rather than the latest production version at the time of this writing (0.11). Nevertheless, the differences aren't that great, especially for the simple examples shown here.

After I installed CouchDB with apt-get, I started it with the following standard command on my server:

/etc/init.d/couchdb start

That starts the CouchDB server on port 5984. By default, this means I can access the CouchDB server as:

http://127.0.0.1:5984/

If you are interested in accessing your CouchDB server from another system, you can modify the CouchDB configuration file (/etc/couchdb/default.ini on my machine) by going to the “httpd” section and replacing the name-value pair:

bind_address = 127.0.0.1

with your IP address instead of 127.0.0.1 (that is, localhost). Restart CouchDB, and it will be accessible not only to local HTTP clients, but also across the Internet.

Obviously, starting a CouchDB server on its well-known port and without any security restrictions is asking for trouble. If you are running a production instance of CouchDB, you should ensure that it cannot be accessed or modified by the general public. CouchDB comes with basic authentication options that make it possible to restrict access to databases, and you should look into those before deploying your system to a public server.

If you point your Web browser to your CouchDB server at port 5984, you will see the following:

{"couchdb":"Welcome","version":"0.10.0"}

This response tells you several things. First, you see that all communication in CouchDB takes place using JSON, the JavaScript object notation that has become a lightweight method for communication among Internet applications. Although CouchDB is written in Erlang, an open-source language designed for distributed processing, nearly everything associated with it uses JavaScript. Functions (as you soon will see) are written in JavaScript, and both inputs and outputs are sent using JSON.

CouchDB is also RESTful—it uses the entire vocabulary of HTTP verbs to describe what should happen and a URL to indicate the object on which the action should take place. Most people are familiar with HTTP's GET and POST verbs, but less so with PUT and DELETE. CouchDB uses all of these, combining HTTP, JSON and REST for rich effect.

Thus, when you point your Web browser to your CouchDB server on port 5984, asking for the document /, you actually are issuing a GET request for the document named /. CouchDB's response describes the server, rather than an individual document. The response is an object (equivalent to a “hash” or “dictionary” in languages such as Perl, Ruby or Python) with two keys. The first, “couchdb”, simply says “Welcome”. The second, named “version”, tells you the version of the server that is running—in this case, 0.10.0.

Let's change the URL somewhat, going instead to the URL /_utils. If you go to that document, you'll see a much more interesting response. Indeed, rather than receiving JSON, you will get a full-fledged Web page, with a CouchDB logo in the top right. This is Futon, the CouchDB Web-based interface. It is sometimes called the administrative interface, but it is also quite useful for experimenting with the database.

Along the right side of the main Futon page is the main “tools” menu. It normally comes up in the overview mode, but you can switch to a number of other screens by clicking on them. Most interesting to me is the test suite, which provides a Web-based interface to ensure that your CouchDB installation is working correctly. Although it is unlikely that your system has any problems, you still might want to run the test suite, just for personal satisfaction and thoroughness.

Creating and Populating a Database

Going back to the overview screen, you should see a prompt at the top saying create database. Just as with most relational database systems, a single server may contain more than one database. Each database then may contain any number of documents, each of which has a unique ID and any number of name-value pairs.

So to get started, you need to create a new database. Click on the link, and an AJAX dialog box opens up, asking for the name of the database. I'm going to assume a database name of “atf” for this column, although you might want to choose something closer to your own name or interests. You may use any alphanumeric characters (plus some symbols) for a database name, keeping in mind that a leading underscore is used by internal CouchDB systems, meaning that you should avoid such names for your own work.

After you create a database, you'll be brought to the browse database page. Click on the new document button to create a new document. CouchDB automatically gives the new document a unique ID value (key name “_id”). You may change the ID to one of your liking, if you have a unique numbering or naming scheme that you prefer.

Then, you may add as many name-value pairs as you like, by clicking on the add field button. The name is assumed to be a string, but the value may be any legitimate JSON value—a number, string, array or object. If you enter an array (within square brackets) into the interactive Futon interface, upon completion, it will be represented visually as an array. The same is true with a JSON object. After you enter it, the name-value pairs are displayed in an easy-to-read format.

Once you have added some fields to your document, click the save button.

I added a number of fields to a document describing me. The fields tab in Futon shows me these values in a nice, easy-to-edit format. If I want to see the document in its native JSON, I can click on the source tab and see it there:

{
   "_id": "0534ca63b70beb02d24b62ec4fe72566",
   "_rev": "4-bea8364f4536833c1fd7de5781ea8a08",
   "first_name": "Reuven",
   "last_name": "Lerner",
   "children": [
       "Atara",
       "Shikma",
       "Amotz"
   ]
}

Notice that in addition to the fields I already have mentioned, there is a “_rev” field. That's because when you save a document, the old version does not disappear. Rather, CouchDB keeps the old one around, much as a garbage collector handles memory in high-level languages, such as Ruby and Python. This means there can be multiple documents with the same “_id” field, although only one is considered current—the one with the latest “_rev” field value. The revision contains an integer as well as an MD5 hash value. You normally can look at only the integer to identify the revision, ignoring the hex portion of the string.

Do not mistake the revision tag as a means of keeping backups or for version control. The moment someone compacts a database, all of the old revisions are removed.

As with other non-relational databases, CouchDB allows you to add, remove and rename fields whenever you like. Each document in a database might have its own unique field names, although in practice, this is fairly rare. It is far more common for each document to have a common set of fields, perhaps with some variation in special cases. It is common to say that CouchDB is “schemaless”, but I think it's safer to say that CouchDB (and other NoSQL storage facilities) allows the programmer to decide on the schema at runtime, rather than in advance—much as a dynamic programming language allows you to determine the type of a variable at runtime, rather than at compile time.

One thing that obviously is missing from a JSON-based database is the notion of a foreign key—a pointer from one document, or record, to another. There is no built-in facility for linking one document to another, although there certainly are ways to use information in one document to view another document.

Outside Futon

It's very nice that CouchDB comes with an easy-to-use, browser-based interface. However, this interface is clearly not what you want to be using from your applications. As I wrote above, CouchDB communicates with the outside world using JSON over HTTP. Any action that you just performed via the browser also should be possible via an HTTP client. You could use a library for a programming language; every major language has at least one CouchDB client. But a popular and easy-to-use option is the curl command-line program.

To send a simple GET request to my CouchDB server, I can write:

curl http://atf.lerner.co.il:5984/

And sure enough, I receive the same response as before:

{"couchdb":"Welcome","version":"0.10.0"}

Unfortunately, if something goes wrong, curl won't say much. For that reason, I generally prefer to use the -v option to curl (and most other programs, for that matter), which shows me the HTTP request and response as they take place. It also comes in handy to specify the HTTP verb you want to use (GET, in this case), so I'll do that with the -X option. Thus, I can write:

~$ curl -vX GET http://atf.lerner.co.il:5984/

And I see:


* About to connect() to atf.lerner.co.il port 5984 (#0)
*   Trying 69.55.225.93... connected
* Connected to atf.lerner.co.il (69.55.225.93) port 5984 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.19.4 (universal-apple-darwin10.0) libcurl/7.19.4
> OpenSSL/0.9.8l zlib/1.2.3
> Host: atf.lerner.co.il:5984
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: CouchDB/0.10.0 (Erlang OTP/R13B)
< Date: Mon, 12 Apr 2010 12:03:05 GMT
< Content-Type: text/plain;charset=utf-8
< Content-Length: 41
< Cache-Control: must-revalidate
<
{"couchdb":"Welcome","version":"0.10.0"}
* Connection #0 to host atf.lerner.co.il left intact
* Closing connection #0

You might notice that the “Content-type” response header indicates that what the server sends back is in text/plain format. So, although you might see the content as JSON, CouchDB itself indicates that it's sending plain text. This isn't a big deal, unless you are writing a program that specifically waits for JSON, so you might need to modify its expectations a bit.

You can request your Futon URL as well, using HEAD to avoid the long response:


~$ curl -vX HEAD http://atf.lerner.co.il:5984/_utils/

* About to connect() to atf.lerner.co.il port 5984 (#0)
*   Trying 69.55.225.93... connected
* Connected to atf.lerner.co.il (69.55.225.93) port 5984 (#0)
> HEAD /_utils/ HTTP/1.1
> User-Agent: curl/7.19.4 (universal-apple-darwin10.0) libcurl/7.19.4
> OpenSSL/0.9.8l zlib/1.2.3
> Host: atf.lerner.co.il:5984
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: CouchDB/0.10.0 (Erlang OTP/R13B)
< last-modified: Fri, 23 Oct 2009 12:40:09 GMT
< Date: Mon, 12 Apr 2010 12:04:43 GMT
< Content-Type: text/html
< Content-Length: 3158

In this case, you get a text/HTML response. And, of course, you know that Futon sends HTML for its response, because you already have been using it from a Web browser.

Now, let's try to look at the atf database, which I created earlier, that contains a single document (that is, record). How can I retrieve that information?

Well, I can start by asking for the database (leaving off the -v option now for space reasons):

~$ curl -X GET http://atf.lerner.co.il:5984/atf

{"db_name":"atf","doc_count":1,"doc_del_count":0,"update_seq":4,
 "purge_seq":0,"compact_running":false,"disk_size":16473,
 "instance_start_time":"1271067859057749","disk_format_version":4}

In other words, asking for a database gives basic information about that database, from the number of documents to the amount of space it consumes on the disk.

You can retrieve an individual document by using its ID:

~$ curl -X GET
↪http://atf.lerner.co.il:5984/atf/0534ca63b70beb02d24b62ec4fe72566

{"_id":"0534ca63b70beb02d24b62ec4fe72566",
 "_rev":"4-bea8364f4536833c1fd7de5781ea8a08",
 "first_name":"Reuven",
 "last_name":"Lerner",
 "children":["Atara","Shikma","Amotz"]}

If I want to modify one or more fields in this document, or even add another field, I can do so with a PUT command. curl's -d option lets me specify a document on the command line:

~$ curl -X PUT
↪http://atf.lerner.co.il:5984/atf/0534ca63b70beb02d24b62ec4fe72566
   -d '{"first_name": "Superman", "middle_initial": "M."  }'

{"error":"conflict","reason":"Document update conflict."}

Well, this is surprising. CouchDB is complaining that it cannot perform the update I need, because there is a conflict. Notice that it does not report the error using HTTP codes (such as 500), but rather by sending a JSON object back to me, containing the “error” key.

The reason CouchDB gives an error message here is that I haven't indicated which revision I am attempting to update. Without such a revision indicator, CouchDB assumes I have stale data and, thus, will not allow me to update the document. Only if I send my update with the current “_rev” value will the update succeed. For example:

~$ curl -X PUT
↪http://atf.lerner.co.il:5984/atf/0534ca63b70beb02d24b62ec4fe72566
   -d '{"_rev": "4-bea8364f4536833c1fd7de5781ea8a08",
        "first_name": "Superman", "middle_initial": "M."  }'

CouchDB responds with:

{"ok":true,"id":"0534ca63b70beb02d24b62ec4fe72566","rev":
↪"5-fe6fccb89b9512d26120fbd63dbb15c4"}

In other words, the update succeeded, incrementing the revision. If you try the same update again, you will get the same “update conflict” error message as before, because there can be only one update to a given revision.

Note that when you PUT an update to a document, you must update the entire document at once. Unlike the UPDATE command in a relational database, adding a new revision to a CouchDB document does not modify individual fields. Rather, it stores an entirely new document with the same ID and an incremented revision number. This means in this example, it's true that I have added the “middle_initial” field successfully. However, I also have effectively removed the “children” field, because I did not specify it in my PUT statement.

You can add an entirely new document to your database using the POST verb in HTTP. For example:

~$ curl -X POST http://atf.lerner.co.il:5984/atf
  -d '{"first_name" : "Atara", "last_name" : "Lerner-Friedman"}'

Sure enough, I get the following response, indicating that a new document was created:

{"ok":true,"id":"aeb6925eb23278f1b8e530ba67b0172d",
 "rev":"1-f0e336978a368f679ee7b280107bc2fb"}

I should add that I had a terrible time trying to use curl to create a document, all because of the quotes. It seems that you must use double quotes inside a JSON request (around the names of the keys and values). Single quotes result in a strange error message indicating that the UTF-8 encoding for JSON is invalid, which did not quite point me in the right direction.

Conclusion

CouchDB is an increasingly popular non-relational database, offering a great deal of flexibility in storage and retrieval. This month, I explained how to create databases in CouchDB and do basic storage and retrieval using both the Web-based Futon interface and curl. Next month, I will demonstrate writing JavaScript functions that process and display the data, demonstrating the true power of CouchDB.

Resources

The home page for CouchDB is at the Apache Project (couchdb.apache.org). There, you can not only download the software, but also read documentation, from tutorials to an active wiki. The CouchDB Web site also has links to drivers for the various languages you're likely to use when working with CouchDB.

If you're interested in the JSON format used by CouchDB, you can learn more about it at the main Web site: json.org.

Finally, two good books on CouchDB were released in the past few months. Beginning CouchDB by Joe Lennon and published by Apress is aimed more at beginners, but it has a solid introduction to CouchDB, Futon and how you might use the system. CouchDB: The Definitive Guide by J. Chris Anderson, Jan Lehnardt and Noah Slater, published by O'Reilly, is a bit more advanced and meaty, but it might not be appropriate for beginning users of non-relational databases.

Reuven M. Lerner is a longtime Web developer, architect and trainer. He is a PhD candidate in learning sciences at Northwestern University, researching the design and analysis of collaborative on-line communities. Reuven lives with his wife and three children in Modi'in, Israel.

Load Disqus comments