At the Forge - Cassandra

Meet the non-relational database that scales to handle even Amazon- and Facebook-size loads.

The home page for Cassandra is From there, you can download Cassandra and install it on your computer. Because Cassandra is written in Java, there is only one distribution binary, which should work on any computer with a current JVM.

On my computer running Ubuntu, I first installed the latest Java JDK with:

apt-get install openjdk-6-jdk

Following this, I could have downloaded the latest Cassandra version and installed it. But instead, I decided to use apt-get to retrieve the latest version and to ensure that I will receive updates in the future. In order to do this, I first needed to add the appropriate GPG keys to my keychain, as per the instructions on the Cassandra Wiki:

gpg --keyserver --recv-keys F758CE318D77295D
gpg --export --armor F758CE318D77295D | sudo apt-key add -

Following that, I added these two lines to /etc/apt/sources.list:

deb unstable main
deb-src unstable main

Next, I ran apt-get update to retrieve the latest version information for all packages, and then I ran apt-get install cassandra to install it on the server. About a minute later, Cassandra was installed and ready to run on my machine.

I started it up with:

/etc/init.d/cassandra start

Sure enough, a quick peek at ps showed me that Cassandra indeed was running.

Talking to Cassandra

There are numerous interfaces to Cassandra from a variety of programming languages. However, the easiest way to connect to Cassandra often is via its built-in command-line interface (CLI), which comes with the program. Simply enter cassandra-cli in your shell, and you'll see a prompt that looks like this:

Welcome to cassandra CLI.

Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.

Your first task should be to connect to your local Cassandra server:

cassandra> connect localhost/9160
Connected to: "Test Cluster" on localhost/9160

In case you forgot what was just printed, you can get the current cluster name with:

cassandra> show cluster name
Test Cluster

You also can get a list of keyspaces in this cluster:

cassandra> show keyspaces

The system keyspace, as you can imagine, is used for Cassandra system tasks. It can be fun and interesting to explore, but you don't want to mess with it unless you really know what you're doing.


What if you want to create a new keyspace? Well, that's where you'll need to go in and change the system's configuration and restart Cassandra. The configuration file you need to modify is called storage-conf.xml. After I installed Cassandra on my Ubuntu system, it was placed in /etc/cassandra/storage-conf.xml. (The filename always will be storage-conf.xml, but the location might differ on your machine, depending on how you installed it.) You can see the contents of this configuration file from the Cassandra CLI, with the command:

cassandra> show config file

However, this command shows only the contents of the file, not its location, so you might have to poke around a bit to find it.

To add a new keyspace to your Cassandra cluster, first you must think about what you want to store and then how you can represent that in Cassandra. As an example, let's store a list of users. You don't need to think beyond that right now; all you need to define is the name of your column family. Individual columns and values can and will be defined on the fly.

To do this, define a new keyspace and one new column family. Each column family is analogous to a table in a relational database; it contains zero or more columns. Each column, in turn, is a name-value pair. Thus, by defining your keyspace as follows, you're basically saying you want to store information about users:

<Keyspace Name="People">
<ColumnFamily Name="Users" CompareWith="BytesType"/>

Like a relational database, you'll be able to store many fields of information about these users. Unlike a relational database, you don't need to define them from the start. Also unlike a relational database, you'll be able to retrieve information about users only via the key you use for this column family. So, if you use e-mail addresses as keys into the “Users” column family, you'll need an address to do something; having the person's first and last name will not do you much good.

Cassandra stores information as a set of bytes; there are no internal types. However, you can (and should) indicate to Cassandra how the data should be sorted. Specifying a “comparator” allows you to simulate the storage of different types. More important, it determines the order in which you will receive results. That's because there is no ORDER BY equivalent in Cassandra when you retrieve data; you need to decide on an order and specify it in the configuration file. Somewhat surprisingly, the ordering is done when the data is written, not when it is read. In the case of the example “Users” column family, you'll just retrieve them in byte order.

If you put the above <Keyspace> section inside the <Keyspaces> tag in your storage-conf.xml file and restart Cassandra, you'll find that it fails to start up. (The error logs are in /var/log/cassandra, at least in my Ubuntu installation.) That's because there are three other definitions you need to include: ReplicaPlacementStrategy, ReplicationFactor and EndPointSnitch. None of these definitions will concern you when you have a single Cassandra node, so I suggest simply copying them from the included Keyspace1 keyspace. In the end, this part of your keyspace definition will look like this:

<Keyspace Name="People">
<ColumnFamily Name="Users" CompareWith="BytesType"/>




Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.


Tyler Hobbs's picture


That was a great informational piece! You make a lot of useful observations that are hard for an insider to notice any more.

I do have a couple of corrections. You say that "All nodes eventually contain all data." This is generally not the case. You set a replication factor (RF) per keyspace which determines how many nodes store a copy of each row (a set of data associated with a key). If RF is less than the number of nodes in your cluster, every node will contain different (but overlapping) sets of data.

Second, although it is true that in Cassandra 0.6 you must restart a node to create a new Column Family or Keyspace, it is no longer true for 0.7 (released yesterday). Keyspace and Column Families may be created, altered, or dropped on a live cluster.