At the Forge - Cassandra
Cassandra provides a non-relational storage and retrieval mechanism (NoSQL database) that features tremendous scalability, speed and flexibility. The inclusion of super columns (and super-column families, which I didn't discuss here) gives you just enough flexibility to store a great deal of information about many users. So long as you never have to search on anything other than the primary key or join information from different users at the database level, Cassandra is a good choice.
That said, Cassandra is significantly harder to understand and administer than other non-relational databases. I think the investment of time and effort are worth it, but you shouldn't expect to be able to work with Cassandra as quickly and easily as with, say, CouchDB or MongoDB. The flip side of this issue is that administration allows you to fine-tune a number of aspects of Cassandra's networking and consistency until you reach a level with which you're comfortable.
Next month, I'll continue exploring and discussing Cassandra, looking at ways to connect multiple Cassandra boxes to a cluster—and what happens when you do so.
The Cassandra home page is at cassandra.apache.org. You might find references to another Cassandra page; it only recently “graduated” to become a full-fledged Apache project, rather than an “incubator” project; thus, some references will be out of date. This page contains download links, documentation, an actively maintained wiki and links to papers, tutorials and drivers in a number of languages.
Cassandra is based on Amazon's Dynamo, the original paper for which is useful in understanding some of the design decisions. You can read this paper at www.allthingsdistributed.com/2007/10/amazons_dynamo.html.
Two complementary video talks describing Cassandra, but aimed more at the network storage aspects (rather than the practical day-to-day usage) are at www.parleys.com/#sl=1&st=5&id=1866 and vimeo.com/5185526.
Finally, although I still find the Cassandra documentation to be a bit lacking, a growing number of blogs, tutorials and testimonials have made their way onto the Web. Three that I particularly enjoyed were Arin Sarkissian's “WTF is a SuperColumn? An Intro to the Cassandra Data Model” (arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model), Evan Weaver's “Up and Running with Cassandra” (blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra) and Dominic Williams' “HBase vs Cassandra: why we moved” (ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved).
Reuven M. Lerner is a longtime Web developer, architect and trainer. He is a PhD candidate in learning sciences at Northwestern University, researching the design and analysis of collaborative on-line communities. Reuven lives with his wife and three children in Modi'in, Israel.
Special Reports: DevOps
Have projects in development that need help? Have a great development operation in place that can ALWAYS be better? Regardless of where you are in your DevOps process, Linux Journal can help!
With deep focus on Collaborative Development, Continuous Testing and Release & Deployment, we offer here the DEFINITIVE DevOps for Dummies, a mobile Application Development Primer, advice & help from the experts, plus a host of other books, videos, podcasts and more. All free with a quick, one-time registration. Start browsing now...