SQL vs. NoSQL
The articles on NoSQL databases in Reuven M. Lerner's At the Forge column appearing in recent issues of LJ have been enjoyable. Because this is the Enterprise issue, I think it would be helpful to take a step back and look at the Linux database landscape and examine in particular the ongoing “battle” between SQL and NoSQL databases. By way of disclosure, I work for Monty Program, a company whose primary product is MariaDB, a community-enhanced branch of MySQL. That being said, I approached this topic with as open a mind as possible.
The rivalry between SQL and NoSQL has been building during the past year to the point where some people are predicting the end of the SQL era. Actually, the two camps are largely complementary, because they're designed to solve different problems.
Whenever the topic of databases arises, an alphabet soup is thrown around that would make NASA proud. Some of the acronyms I use a lot in this article include:
RDBMS: Relational Database Management System.
SQL: Structured Query Language, also used to refer to databases that use SQL as their query language.
NoSQL: used to refer to a class of databases that are non-relational and do not use SQL as their query language. They could perhaps be better called Distributed Database Management Systems (or DDBMSes), but for now, the popular term is NoSQL.
ACID: Atomicity, Consistency, Isolation, Durability (see the What Is ACID? sidebar).
CAP: Consistency, Availability, Partition tolerance (see the What Is CAP? sidebar).
So, what is the big deal about NoSQL databases? For one, they've introduced new ways (or perhaps re-introduced old ways) of thinking about what databases are and what they can do. For another, they're shiny and new, and all the cool kids seem to be using them. You could argue that Google's BigTable is the database that inspired the NoSQL movement. Or, maybe it was Amazon's S3. Both of them are closed source, but they were (or are) impressive enough to inspire open-source interpretations.
The current NoSQL field includes HBase, Cassandra, Redis, MongoDB, Voldemort, CouchDB, Dynomite, Hypertable and several others. Some have followed the model of BigTable, others follow S3's model, some are a mix of the two, and others are charting their own path. Some of these projects are more mature than others, but each of them is trying to solve similar problems.
Instead of having tables with columns and rows like you would find in a traditional RDBMS, most NoSQL databases are simple “key-value stores”. Each piece of data that goes into the database is given a key, and when you want the data back, you use the key to get it. This simplicity is beneficial, because it helps busy sites achieve extremely low latency, even under high load, when paired with a large number of servers and a fast network. The simplicity of the key-value model also simplifies development.
A step beyond simply having keys and values are the so-called document databases. A document, in this case, is a collection of various fields of information. Each individual document can have a different number of fields of varying lengths. These databases are useful if you have a lot of semi-structured data, and they are a good fit for object-oriented programming models (for example, you can consider the database as a storage area for objects).
Why do traditional database users dislike these newcomers? D. Richard Hipp, the creator of SQLite, in a talk given at my local LUG, derisively called NoSQL databases “post-modern databases”, because instead of giving you a definite answer to your question, they give you “an opinion” or their “best guess”. His purpose was to paint NoSQL databases in a bad light, and for most of the old-school database world, the NoSQL, non-relational, BASE model (see the What Is ACID? sidebar) is more than a bit heretical.
The heresy comes because historically, databases almost always have tried to implement the relational model and be fully ACID-compliant. If your transactions weren't ACID, or your database wasn't relational, the argument went, you couldn't call yourself a “real” database. Look at the MySQL vs. PostgreSQL flame wars for ample evidence of this thinking.
The problem though, is that being relational and ACID is not necessary for some use cases and can add unnecessary overhead, which you don't want if you are running a popular, heavily trafficked Web site. Many early users of MySQL knew this and were mocked for choosing MySQL over “real” databases like PostgreSQL. It is ironic now that MySQL has gained what every “expert” said it should have (ACID transactions), that a new movement has started up claiming that all the old database technology isn't actually necessary.
What is necessary for top-tier Web sites, according to proponents of NoSQL, is massive scalability, low latency, the ability to grow the capacity of your database on demand and an easier programming model. These, and others, are things which, according to them, SQL RDBMSes just don't provide in a cost-effective manner.
Most classic RDBMSes initially were designed to run on a single large server. That is how it was done in the late 1970s and early 1980s, and the idea exists in the design of many RDBMSes to this day. The P in CAP (see the What Is CAP? sidebar) is meaningless when the database is on a single server (the server is either up or down, rarely or never only partly up), and traditional RDBMSes have focused mainly on Consistency, aka ACID, with Availability thrown in if you mirror between database servers or use hardware with no single points of failure.
Some NoSQL databases also focus on the C and A parts of CAP. Unlike traditional RDBMSes though, these databases are designed from the ground up to be run on dozens, hundreds or even thousands of nodes in a single data center. Partial partition tolerance for these databases is obtained by mirroring database clusters between multiple data centers. The advantage these databases have over a traditional RDBMS is that with the work spread over all of those machines, you can achieve ultra-low latency even when there are extremely high numbers of reads and writes, and with all those machines, you can analyze massive amounts of data quickly.
Other NoSQL databases focus on the A and P parts of CAP and are designed to span multiple data centers. True to CAP, strong consistency is impossible for these databases. Weak consistency is an especially heretical thought to the RDBMS old guard. Instead, these NoSQL databases implement eventual consistency, whereby any changes are replicated to the entire database eventually, but at any given time, a single node or group of nodes may not have the latest data. Like the NoSQL databases, which focus on C and A, the focus for A and P databases is on low latency, high throughput and anything else that makes the Web site more responsive and a richer experience for users.
In addition to sometimes abandoning consistency in favor of scalability and latency, another way NoSQL databases break with tradition is in their abandonment of the relational model. To be fair, some data truly does not naturally fit the relational model. This could be because the data changes form or size often, or because the data is completely unstructured.
The final break with tradition in NoSQL databases is the thing that gave them their name. They don't use SQL. The reasons for dropping SQL usually revolve around it not fitting in with modern object-oriented development processes or some perceived difficulty in working with SQL. Sometimes the excuse given for not using SQL is a simple “SQL sucks”, which isn't really a reason. Stupid reasons aside, the SQL language was designed for use with relational databases, and NoSQL databases are mostly non-relational, so it makes sense that they don't use it.
- The Ubuntu Conspiracy
- A First Look at IBM's New Linux Servers
- Vigilante Malware
- Disney's Linux Light Bulbs (Not a "Luxo Jr." Reboot)
- Vagrant Simplified
- Libreboot on an X60, Part I: the Setup
- System Status as SMS Text Messages
- Dealing with Boundary Issues
- Bluetooth Hacks
- Non-Linux FOSS: Code Your Way To Victory!