At the Forge - Cassandra
The home page for Cassandra is cassandra.apache.org. From there, you can download Cassandra and install it on your computer. Because Cassandra is written in Java, there is only one distribution binary, which should work on any computer with a current JVM.
On my computer running Ubuntu, I first installed the latest Java JDK with:
apt-get install openjdk-6-jdk
Following this, I could have downloaded the latest Cassandra version and installed it. But instead, I decided to use apt-get to retrieve the latest version and to ensure that I will receive updates in the future. In order to do this, I first needed to add the appropriate GPG keys to my keychain, as per the instructions on the Cassandra Wiki:
gpg --keyserver wwwkeys.eu.pgp.net --recv-keys F758CE318D77295D gpg --export --armor F758CE318D77295D | sudo apt-key add -
Following that, I added these two lines to /etc/apt/sources.list:
deb http://www.apache.org/dist/cassandra/debian unstable main deb-src http://www.apache.org/dist/cassandra/debian unstable main
Next, I ran apt-get update to retrieve the latest version information for all packages, and then I ran apt-get install cassandra to install it on the server. About a minute later, Cassandra was installed and ready to run on my machine.
I started it up with:
/etc/init.d/cassandra start
Sure enough, a quick peek at ps showed me that Cassandra indeed was running.
There are numerous interfaces to Cassandra from a variety of programming languages. However, the easiest way to connect to Cassandra often is via its built-in command-line interface (CLI), which comes with the program. Simply enter cassandra-cli in your shell, and you'll see a prompt that looks like this:
Welcome to cassandra CLI. Type 'help' or '?' for help. Type 'quit' or 'exit' to quit. cassandra>
Your first task should be to connect to your local Cassandra server:
cassandra> connect localhost/9160 Connected to: "Test Cluster" on localhost/9160
In case you forgot what was just printed, you can get the current cluster name with:
cassandra> show cluster name Test Cluster
You also can get a list of keyspaces in this cluster:
cassandra> show keyspaces Keyspace1 system
The system keyspace, as you can imagine, is used for Cassandra system tasks. It can be fun and interesting to explore, but you don't want to mess with it unless you really know what you're doing.
What if you want to create a new keyspace? Well, that's where you'll need to go in and change the system's configuration and restart Cassandra. The configuration file you need to modify is called storage-conf.xml. After I installed Cassandra on my Ubuntu system, it was placed in /etc/cassandra/storage-conf.xml. (The filename always will be storage-conf.xml, but the location might differ on your machine, depending on how you installed it.) You can see the contents of this configuration file from the Cassandra CLI, with the command:
cassandra> show config file
However, this command shows only the contents of the file, not its location, so you might have to poke around a bit to find it.
To add a new keyspace to your Cassandra cluster, first you must think about what you want to store and then how you can represent that in Cassandra. As an example, let's store a list of users. You don't need to think beyond that right now; all you need to define is the name of your column family. Individual columns and values can and will be defined on the fly.
To do this, define a new keyspace and one new column family. Each column family is analogous to a table in a relational database; it contains zero or more columns. Each column, in turn, is a name-value pair. Thus, by defining your keyspace as follows, you're basically saying you want to store information about users:
<Keyspace Name="People"> <ColumnFamily Name="Users" CompareWith="BytesType"/> </Keyspace> </Keyspaces>
Like a relational database, you'll be able to store many fields of information about these users. Unlike a relational database, you don't need to define them from the start. Also unlike a relational database, you'll be able to retrieve information about users only via the key you use for this column family. So, if you use e-mail addresses as keys into the “Users” column family, you'll need an address to do something; having the person's first and last name will not do you much good.
Cassandra stores information as a set of bytes; there are no internal types. However, you can (and should) indicate to Cassandra how the data should be sorted. Specifying a “comparator” allows you to simulate the storage of different types. More important, it determines the order in which you will receive results. That's because there is no ORDER BY equivalent in Cassandra when you retrieve data; you need to decide on an order and specify it in the configuration file. Somewhat surprisingly, the ordering is done when the data is written, not when it is read. In the case of the example “Users” column family, you'll just retrieve them in byte order.
If you put the above <Keyspace> section inside the <Keyspaces> tag in your storage-conf.xml file and restart Cassandra, you'll find that it fails to start up. (The error logs are in /var/log/cassandra, at least in my Ubuntu installation.) That's because there are three other definitions you need to include: ReplicaPlacementStrategy, ReplicationFactor and EndPointSnitch. None of these definitions will concern you when you have a single Cassandra node, so I suggest simply copying them from the included Keyspace1 keyspace. In the end, this part of your keyspace definition will look like this:
<Keyspace Name="People"> <ColumnFamily Name="Users" CompareWith="BytesType"/> <ReplicaPlacementStrategy>org.apache.cassandra.locator. ↪RackUnawareStrategy</ReplicaPlacementStrategy> <ReplicationFactor>1</ReplicationFactor> <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch ↪</EndPointSnitch> </Keyspace>
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- New Products
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- Validate an E-Mail Address with PHP, the Right Way
- New Products
- Tech Tip: Really Simple HTTP Server with Python
- Trying to Tame the Tablet
- git-annex assistant
2 hours 12 min ago - direct cable connection
2 hours 35 min ago - Agreed on AirDroid. With my
2 hours 45 min ago - I just learned this
2 hours 49 min ago - enterprise
3 hours 19 min ago - not living upto the mobile revolution
6 hours 10 min ago - Deceptive Advertising and
6 hours 46 min ago - Let\'s declare that you have
6 hours 47 min ago - Alterations in Contest Due
6 hours 48 min ago - At a numbers mindset, your
6 hours 49 min ago
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




Comments
Corrections
Reuven,
That was a great informational piece! You make a lot of useful observations that are hard for an insider to notice any more.
I do have a couple of corrections. You say that "All nodes eventually contain all data." This is generally not the case. You set a replication factor (RF) per keyspace which determines how many nodes store a copy of each row (a set of data associated with a key). If RF is less than the number of nodes in your cluster, every node will contain different (but overlapping) sets of data.
Second, although it is true that in Cassandra 0.6 you must restart a node to create a new Column Family or Keyspace, it is no longer true for 0.7 (released yesterday). Keyspace and Column Families may be created, altered, or dropped on a live cluster.