Databases for Free
Anywhere data needs to be stored and organized, a database is needed. Sid shows what's available.
by Sid Wentworth
In addition to needing an operating system, some embedded systems will need a structure for organizing data. In this article I discuss various ways of storing and retrieving data. Where appropriate, I include references to free packages that may be what you need.
UNIX, on which Linux is based, has a long history of supporting relatively fast, highly reliable filesystems. While Linux filesystems are not directly based on the UNIX ones, Linux does offer reliable storage options.
The default Linux filesystem, called ext2, has been in service for about eight years. While there have been some minor enhancements during that time, the basic structure remains the same and reliability has never been questioned.
The one disadvantage of the ext2 filesystem is that after a system crash, it is necessary to run a check program to repair any possible damage. This increases system startup (boot) time and, if there is severe damage, could require manual input. This problem is not unique to Linux file systems but is something you should be aware of.
An alternative type of filesystem is called a journaling filesystem. In this filesystem, a transaction log is maintained so that the filesystem itself can perform recovery rather than requiring an external program. Such a filesystem was added to UNIX system V about ten years ago. The Reiser filesystem, a type of journaling filesystem, recently became available for Linux. It is, however, relatively new, so we don't have a good history of reliability yet.
Now, on to specific approaches for data storage and retrieval. All these methods will run on top of whichever Linux filesystem you select.
The most basic way to store data is to save it in regular files. These files are commonly called flat files because there is no indexing information available. You must read the file sequentially to search for a particular datum.
UNIX systems have included what are called DBM routines. While not precisely a database, DBM provides a handy method for storage and retrieval of data using access keys.
Just to make things a little confusing, you will be able to find libraries called dbm, ndbm, gdbm and db. Rather than offer an exhaustive list of options, I address one specific alternative: the Berkeley DB. Originally from the BSD version of UNIX, Berkeley DB is available as a commercial product from Sleepycat Software.
Before you reach for your wallet, let me explain what commercial means here. Sleepycat makes Berkeley DB available at no charge, including source code, if you use the software internally only or if you make your marketed product's source code freely available. If you find this licensing scheme unpalatable, look into ndbm. It may be a better choice for you.
Back to Berkeley DB. What it is not is a relational database package. It is, in fact, a C-callable set of functions that allow you to store and retrieve key/value pairs. These key/value pairs can be pretty much anything you want (that is, any data type and any length up to 2<+>32<+> bytes), which gives you the flexibility to do what you want. Additional features include the ability to have multiple databases; keyed and sequential access to records; memory-mapped, read-only databases; and record locking.
You can download the whole package from the Sleepycat web site, http://www.sleepycat.com/. What you will find included are three packages, the data store, concurrent data store and transactional data store.
As we move through this list, we get more features, but that means more code space and more overhead. The basic data store only offers support for one writer and multiple readers. The concurrent data store adds support for multiple writers, while the transactional data store adds locking options and the code necessary to handle hot backups and disaster recovery. Again, the right choice depends on your application.
While the native language of Berkeley DB is C, interface libraries have been built for other programming languages including Perl, Python and Tcl.
If you still need more capabilities, open-source relational databases are available. Relational databases allow you to store data in uniform structures called tables. Each table has rows and columns. Each row represents one data set or record, and each column represents a particular field within the record. For example, a table of your friends might have their name, phone number and date of birth. The table might look like the one shown in Table 1.
The power of relational databases comes when you create multiple tables and are then able to associate them using a common column in each table. This operation is called a join. For example, let's say you had a second table that contained your friends' names and their e-mail addresses (see Table 2).
By performing a join operation on these two tables and selecting the Name, Phone and E-mail columns from the result, you could produce a list containing these three columns. While this is a very basic look at what a relational database is, I expect you get the idea of where this could lead.
All that is the good news. The bad news is that in order to realize the power of a relational database you have to learn a query language. The most popular of these languages is SQL, which stands for structured query language. Originally developed by IBM, ANSI and the International Standards Organization (ISO) published SQL standards in 1986 and 1987. Subsequent standards have been a joint effort of ANSI and ISO.
Besides entering SQL statements yourself to access data, applications program interfaces (APIs) are available for most other programming languages that allow you to access SQL databases.
One important capability of a relational database is called a transaction, which allows you to identify a set of database updates that are to be treated as one event. That is, you can be assured that either all of them will be performed or none of them performed. For example, in an accounting application, you need to guarantee that a debit is performed only if the related credit is also performed.
Unlike Berkeley DB, most relational databases contain separate client and server parts. What this means is that rather than linking the database system code with your program (like Berkeley DB), you only link with the client. The client code then uses a communications protocol such as TCP/IP to communicate with the server.
The advantage of this client/server approach is that you need not run the client and the server on the same computer system. Thus, this approach scales well. The disadvantage is the additional overhead of supporting the communications protocol. Again, the right choice will depend on your application.
- Readers' Choice Awards--Nominate Your Apps & Gadgets Now!
- Memory Ordering in Modern Microprocessors, Part I
- Source Code Scanners for Better Code
- diff -u: What's New in Kernel Development
- RSS Feeds
- Non-Linux FOSS: AutoHotkey
- Security Hardening with Ansible
- Tech Tip: Really Simple HTTP Server with Python
- Using Django and MongoDB to Build a Blog