Databases for Free | Linux Journal

Software

by Sid Wentworth

on May 1, 2001

Databases for Free

Anywhere data needs to be stored and organized, a database is needed. Sid shows what's available.

by Sid Wentworth

In addition to needing an operating system, some embedded systems will need a structure for organizing data. In this article I discuss various ways of storing and retrieving data. Where appropriate, I include references to free packages that may be what you need.

UNIX, on which Linux is based, has a long history of supporting relatively fast, highly reliable filesystems. While Linux filesystems are not directly based on the UNIX ones, Linux does offer reliable storage options.

The default Linux filesystem, called ext2, has been in service for about eight years. While there have been some minor enhancements during that time, the basic structure remains the same and reliability has never been questioned.

The one disadvantage of the ext2 filesystem is that after a system crash, it is necessary to run a check program to repair any possible damage. This increases system startup (boot) time and, if there is severe damage, could require manual input. This problem is not unique to Linux file systems but is something you should be aware of.

An alternative type of filesystem is called a journaling filesystem. In this filesystem, a transaction log is maintained so that the filesystem itself can perform recovery rather than requiring an external program. Such a filesystem was added to UNIX system V about ten years ago. The Reiser filesystem, a type of journaling filesystem, recently became available for Linux. It is, however, relatively new, so we don't have a good history of reliability yet.

Now, on to specific approaches for data storage and retrieval. All these methods will run on top of whichever Linux filesystem you select.

Using Flat Files

The most basic way to store data is to save it in regular files. These files are commonly called flat files because there is no indexing information available. You must read the file sequentially to search for a particular datum.

Using DBM

UNIX systems have included what are called DBM routines. While not precisely a database, DBM provides a handy method for storage and retrieval of data using access keys.

Just to make things a little confusing, you will be able to find libraries called dbm, ndbm, gdbm and db. Rather than offer an exhaustive list of options, I address one specific alternative: the Berkeley DB. Originally from the BSD version of UNIX, Berkeley DB is available as a commercial product from Sleepycat Software.

Before you reach for your wallet, let me explain what commercial means here. Sleepycat makes Berkeley DB available at no charge, including source code, if you use the software internally only or if you make your marketed product's source code freely available. If you find this licensing scheme unpalatable, look into ndbm. It may be a better choice for you.

Back to Berkeley DB. What it is not is a relational database package. It is, in fact, a C-callable set of functions that allow you to store and retrieve key/value pairs. These key/value pairs can be pretty much anything you want (that is, any data type and any length up to 2<+>32<+> bytes), which gives you the flexibility to do what you want. Additional features include the ability to have multiple databases; keyed and sequential access to records; memory-mapped, read-only databases; and record locking.

You can download the whole package from the Sleepycat web site, http://www.sleepycat.com/. What you will find included are three packages, the data store, concurrent data store and transactional data store.

As we move through this list, we get more features, but that means more code space and more overhead. The basic data store only offers support for one writer and multiple readers. The concurrent data store adds support for multiple writers, while the transactional data store adds locking options and the code necessary to handle hot backups and disaster recovery. Again, the right choice depends on your application.

While the native language of Berkeley DB is C, interface libraries have been built for other programming languages including Perl, Python and Tcl.

Relational Database Basics

If you still need more capabilities, open-source relational databases are available. Relational databases allow you to store data in uniform structures called tables. Each table has rows and columns. Each row represents one data set or record, and each column represents a particular field within the record. For example, a table of your friends might have their name, phone number and date of birth. The table might look like the one shown in Table 1.

Table 1. Example of Relational Database Table

The power of relational databases comes when you create multiple tables and are then able to associate them using a common column in each table. This operation is called a join. For example, let's say you had a second table that contained your friends' names and their e-mail addresses (see Table 2).

Table 2. Example Table for join

By performing a join operation on these two tables and selecting the Name, Phone and E-mail columns from the result, you could produce a list containing these three columns. While this is a very basic look at what a relational database is, I expect you get the idea of where this could lead.

All that is the good news. The bad news is that in order to realize the power of a relational database you have to learn a query language. The most popular of these languages is SQL, which stands for structured query language. Originally developed by IBM, ANSI and the International Standards Organization (ISO) published SQL standards in 1986 and 1987. Subsequent standards have been a joint effort of ANSI and ISO.

Besides entering SQL statements yourself to access data, applications program interfaces (APIs) are available for most other programming languages that allow you to access SQL databases.

One important capability of a relational database is called a transaction, which allows you to identify a set of database updates that are to be treated as one event. That is, you can be assured that either all of them will be performed or none of them performed. For example, in an accounting application, you need to guarantee that a debit is performed only if the related credit is also performed.

Unlike Berkeley DB, most relational databases contain separate client and server parts. What this means is that rather than linking the database system code with your program (like Berkeley DB), you only link with the client. The client code then uses a communications protocol such as TCP/IP to communicate with the server.

The advantage of this client/server approach is that you need not run the client and the server on the same computer system. Thus, this approach scales well. The disadvantage is the additional overhead of supporting the communications protocol. Again, the right choice will depend on your application.

Specific Database Choices

As I mentioned earlier, there are many choices for relational databases. For the article, I confine my discussion to two: MySQL and PostgreSQL. A web search will reveal many others, including mSQL and InterBase. MySQL and PostgreSQL, however, are the primary contenders in this area.

MySQL is the more compact of the two choices, and its claim to fame is rapid read access. For that reason, MySQL has become very popular with web server applications where many reads and few writes are the norm.

MySQL has two shortcomings. First, because of a rather primitive locking scheme, as the number of writes increases, performance drops substantially. Second, MySQL did not support transactions until the current version, 3.23. This support, by the way, uses the capabilities of the Berkeley DB software from Sleepycat.

Interfaces are available for most programming languages. Many Linux distributions include MySQL, or you can go to http://www.mysql.com/ for the latest version. As of June 2000, MySQL has been available under the GNU Public License.

PostgreSQL is the most advanced open-source database system. It grew out of Dr. Michael Stonebraker's Postgres project at the University of California at Berkeley. PostgreSQL replaced the QUEL query language of Postgres with the more common (and standard) SQL.

Discussion of the advanced capabilities of PostgreSQL could easily be an article and, more likely, a complete book. Rather than try to document all of its features, let me list some of the highlights. If the previously discussed solutions don't offer all you need for your system and this list sparks your interest, you can find more information and the software itself at http://www.postgresql.org/.

Triggers--the ability to execute server-side functions for each modified row in a table.
Temporary tables--tables that exist only for the duration of a database session.
Foreign keys--the ability to constrain values in a column based on columns in other tables.

Like other databases, there are interfaces to PostgreSQL for most programming languages. Server-side functions (which extend PostgreSQL) can be written in SQL, PL/PSQL, PL/TCL, PL/Perl and C.

Conclusion

If your embedded application needs to store tabular data, you are talking about a database. Everything from a flat file to a relational database is on the table. When shopping for the right answer, there is more to consider than code size and reliability. In particular, think about support issues. Working with flat files is very easy, as is using the Berkeley DB routines. On the other hand, using a relational database such as MySQL or PostgreSQL introduces a lot of new code into your system, a new language and a reasonable investment of time in the role of a database administrator.

When you are developing your application, select or design an interface to the database that is flexible. In other words, make sure all the code that offers the actual interface to the database you select is small and self-contained. That way, the replacement of one database with another will not mean a major rewrite of your entire system.

Sid Wentworth prefers animals to people but also sees that computers can be his friend. He has over 20 years of experience working with computers, virtually all without involvement with Microsoft products. Sid can be reached at swentworth@hushmail.com.

email: swenthworth@hushmail.com

Load Disqus comments