At the Forge - CouchDB Views

by Reuven M. Lerner
CouchDB

Last month's column was an initial look at CouchDB, a non-relational, open-source database server, now sponsored by the Apache Software Foundation. CouchDB uses many Web-related standards: data is stored in JSON format, communication takes place using JSON and RESTful resources, and functions are written in JavaScript. CouchDB is not as speedy as some of the other non-relational (NoSQL) databases, such as MongoDB and Cassandra. However, CouchDB is designed to be dependable and easily replicated across multiple servers—a far cry from relational databases, for which replication remains slightly annoying at best.

Last month, I explained how once you have created a CouchDB database, you can use the curl utility to insert, update and remove documents. Each “document” is nothing more than a JSON object, which means it's basically a hash (a Python dictionary), which then may contain arbitrarily nested levels of arrays, hashes and scalar values (that is, strings and numbers). So, I can create a database with an HTTP PUT request:

curl -X PUT http://localhost:5984/atf

Then, I can add some documents to that database with HTTP POST requests:

curl -X POST http://localhost:5984/atf
     -d '{"first_name" : "Atara", "middle_name": "Margalit", 
          "sex":"f", "last_name" : "Lerner-Friedman", 
          "birthday" : "2000-dec-16"}'

curl -X POST http://localhost:5984/atf
     -d '{"first_name" : "Shikma", "middle_name": "Bruria", 
          "sex":"f", "last_name" : "Lerner-Friedman", 
          "birthday" : "2002-dec-17"}'

curl -X POST http://localhost:5984/atf
     -d '{"first_name" : "Amotz", "middle_name": "David", 
          "sex":"m", "last_name" : "Lerner-Friedman", 
          "birthday" : "2005-oct-31"}'

Then, I can check to see that there are three documents, by using an HTTP GET request on the database:

bash-3.2# curl -X GET http://localhost:5984/atf

   {"db_name":"rmltest","doc_count":3,
    "doc_del_count":0,"update_seq":3,"purge_seq":0,
    "compact_running":false,"disk_size":12377,
    "instance_start_time":"1273430793169153","disk_format_version":4}

As you can see, the "doc_count" attribute shows that there are, indeed, three documents in this database.

Now, if you have only three documents, querying them doesn't make much sense. But, if you have 300, or even 300,000 documents, you certainly are not going to want to iterate over them just to determine which is the best match and/or most appropriate.

If you were using a relational database server, you would use SQL to retrieve the rows that match a particular set of criteria. Even MongoDB, which I covered earlier this year, offers a query language that is vaguely SQL-like. CouchDB, however, offers a completely different query system, based on JavaScript functions and the map-reduce paradigm. CouchDB's syntax takes some getting used to, especially if you are relatively new at writing JavaScript functions. However, a few small functions can give you a great deal of power, which is (perhaps) the secret behind CouchDB's success.

CouchDB Views

I've already explained that CouchDB refers to each stored data item as a “document”. CouchDB defined a special kind of document, known as a “design document”, which contains a “view”—JavaScript code that is executed when you want to perform the query. (Design documents also may contain “show” functions, which can sort or otherwise modify the way in which data is displayed, but I won't discuss “show” functions in this column.) If you are developing only a database or applications, you might want to avoid the overhead of a permanent view by creating a temporary view instead. Temporary views take less time to set up and are a bit more flexible, but they execute much more slowly.

So, let's begin by creating a temporary view and some basic JavaScript views. For the data, I'm using the information I entered above, about my three children. (Feel free to substitute information about your own children, if you prefer.) I find it easiest to create temporary views using Futon, the Web-based administrative and maintenance tool that comes with CouchDB. Simply point your Web browser at the server on which CouchDB is running, on port 5984, and go to your database of choice. Then, select temporary view from the pull-down menu in the top right-hand corner.

Your screen now should consist of two parts: on the left side, you have a simple JavaScript function, under the header “map”:

function(doc) {
  emit(null, doc);
}

If you ever have used “map” in a language such as Ruby, Python, JavaScript or Lisp, this function already might make sense to you: your function is invoked repeatedly for a list of documents. If it produces a key-value pair, that pair is added to the output from the function running across all documents.

For instance, the example function (which is anonymous) takes a document as an argument, and returns a null key and the document itself as the value. If you click “Run” under the code, you get a set of results: three “null” keys on the left side and the original documents (with their mandatory _id field) on the right.

You can, of course, modify the function such that it outputs only information about girls. To do that, write:

function(doc)
{
    if (doc['sex'] == 'f')
    {
        emit(null, doc);
    }
}

Notice how by using a simple if statement, you can eliminate unwanted rows. Now, what if you're interested in getting all the documents, but sorted. CouchDB orders the results by their keys, which means the key you use is useful not only for identifying the resulting documents, but also for ordering them. You could, for example, sort the results by first name:

function(doc) {
        emit(doc.first_name, doc);
}

In my case, this means I first get the record for my son (Amotz), followed by Atara, followed by Shikma. Sorting by last name in this particular case doesn't help very much, because they all have identical last names. But, keys can be any data type, which means you even can use an array to arrange items, by last and then first name:

function(doc) {
        emit([doc.last_name, doc.first_name], doc);
}

You also can sort them by birthday:

function(doc) {
    emit(doc.birthday, doc)
}

However, this will not necessarily have the effect you want. The “birthday” field is a text string, which means the sorting will be done as a string, rather than as a date. (In the case of my children's birthdays, the sorting happens to work out fine, but this is a happy accident, not inherent to CouchDB.)

If you want to create a permanent view, there are a few ways to do so. You can use the Futon (Web-based) interface, and any temporary view can be turned into a permanent one by clicking on save as from the temporary view's screen. But another way, and one that's a bit more flexible if you're writing complex code, is to use curl to PUT a new design document on the server. This document contains JSON, like all other documents in CouchDB, but it has a number of fields that are treated specially by CouchDB. Here is my file, which I called simpleview.json:

{
    "_id" : "_design/example",
    "views": {
        "show_by_birthday": {
            "map" : "function(doc){ emit(doc.birthday, doc) }"
        }
    }
}

Then, I uploaded the contents of this file using curl, as follows:

curl -X PUT http://localhost:5984/atf/_design/simpleview 
 ↪-d @simpleview.json

By using the -d flag and the @ sign, I was able to tell curl to upload the JSON from a file, rather than the command line. I uploaded it to a design document (as you can see from the "_design/" at the beginning of its name), with the view called simpleview. Once uploaded, I then could run it from Futon (by going to the menu item "show_by_birthday"), or by again using curl:

curl -X GET http://localhost:5984/atf/_design/simpleview/
↪_view/show_by_id

The results of the query are the same no matter what. Futon displays them in a nicer format, but it obviously would be easier for a program to work with the JSON output via HTTP. If you want to edit the view, you can either re-upload it via curl or use Futon to go to the view and edit the JavaScript function.

Reduce

So far, the examples here have focused on writing “map” functions, which take the list of documents and outputs a series of key-value pairs. But, as you might imagine, the map-reduce paradigm has two parts, the second of which is called reduce. The idea is that you use map to filter and transform the data into a list, and then use reduce to turn that list into something even more useful, typically a single value. For example, you can define the reduce function as follows:

function(keys, values, rereduce) {
    return 1;
}

From within Futon, this returns a list of three documents, the keys of which are the birthdays, and the values of which is the number 1. This isn't very interesting, to be honest, because you could have accomplished the same thing from within the “map” request. But, if you invoke the same query from curl, you get something else entirely:

{"rows":[
    {"key":null,"value":1}
]}

Why the difference? And, why is there only a single row now?

The answer is that reduce is designed to, well, reduce things. That means instead of returning a result for each row, it returns a single result from all rows. The reduce function actually can be invoked in two ways:

  • The usual way, in which the “keys” and “values” represent document keys and values. In such cases, the “rereduce” parameter is set to false.

  • For rereducing, the “rereduce” parameter is set to true, the keys are null, and the values represent values from a previous, partial run of the “reduce” function.

As the CouchDB Wiki states, this means the reduce function must be both commutative (that is, the order in which arguments are processed doesn't matter) and associative (that is, the order in which operations are executed doesn't matter). This often (but not always) means the reduce function ends up performing addition or multiplication, returning the result of executing the function on all documents. One example of a reduce function that I found calculates the standard deviation from mapped results, so you can find out how similar the documents are along at least one dimension.

Learning to use map-reduce in CouchDB can take some time. However, this paradigm has proven itself for a decade or so. For example, as the backbone for Google's search system, map-reduce has demonstrated excellent performance and flexibility. Granted, Google is using something (or some things) that are not CouchDB, but with a similar interface and paradigm.

Conclusion

If you like the idea of using JavaScript, MVCC, easy replication and JSON documents, CouchDB might be a good choice. It apparently is not as fast as some of its non-SQL competition, such as MongoDB and Cassandra. However, CouchDB's built-in (and sophisticated) Web interface, RESTful communication and the flexibility of map-reduce are all good reasons to use it. CouchDB is so easy to set up and use, why not try it out? Even if you end up not using it, CouchDB is a great way to learn about map-reduce and to try to create some small functions using it.

Resources

The home page for CouchDB is at the Apache Project (couchdb.apache.org). There, you not only can download the software, but also read documentation, from tutorials to an active wiki. The CouchDB Web site also has links to drivers for the various languages you're likely to use when working with CouchDB.

If you're interested in the JSON format used by CouchDB, you can learn more about it at the main Web site: json.org.

Finally, two good books on CouchDB were released in the past few months. Beginning CouchDB by Joe Lennon, published by Apress, is aimed more at beginners, but it has a solid introduction to CouchDB, Futon and how you might use the system. CouchDB: The Definitive Guide by J. Chris Anderson, Jan Lehnardt and Noah Slater, published by O'Reilly, is a bit more advanced and meaty, but it might not be appropriate for beginning users of non-relational databases.

For a list of interesting map-reduce functions for CouchDB, along with explanations of how they work, see this page on the CouchDB Wiki: wiki.apache.org/couchdb/View_Snippets.

Reuven M. Lerner is a longtime Web developer, architect and trainer. He is a PhD candidate in learning sciences at Northwestern University, researching the design and analysis of collaborative on-line communities. Reuven lives with his wife and three children in Modi'in, Israel.

Load Disqus comments