Linux and Databases

by David Penn

One of the things often forgotten when new Linux users are considered is the fact that many of these people are also, for all intents and purposes, new to computers. By this, I don't mean that they have never used a computer, or don't know what a hard drive is or what RAM means. What I mean is effective knowledge about computers: what it is that really makes them work, what the different components do to and for each other, and what makes one computer different from another. If there is one drawback in the current rage over web appliances, it is in the fact that users are becoming more and more alienated from the inner workings of the device. A situation that leads to more power, perhaps, but less empowerment, and a sad combination of increased ignorance and increased dependency.

This situation isn't necessarily bad for most users. In the same way that most people don't really care what is under the hoods of their automobiles (as long as someone will spend a week fixing their car when it breaks down), most people don't want or need to know what's under the gray shell of their PC. But for those who are curious, those who do want to have some leverage of information against a cold, speechless machine, Linux represents an opportunity to learn about not just a new, different operating system, but also a chance to, in a sense, learn about computers all over again. And often from the ground up.

This week on Linux Buzz, I want to talk about databases.

What are databases?

A database is really any set of structured data or information. An address book, for example, or the American League team standings could both be considered "databases". Once we understand what a database is, fundamentally speaking, it becomes easier to scale upwards toward more and more complex databases, such as a financial general ledger or the stock charts in the Wall Street Journal. When we consider the immense volumes of information handled by computers, from the days of mainframes to the days of embedded systems, we also need to consider increasingly complex and powerful ways to enter, store and retrieve data. Computerized databases, unlike the "databases" mentioned above, provide this power to arrange and manipulate in ways far more versatile, efficient and powerful than ever before.

However, computerized databases tend to be only part of an overall system of data storage, management and retrieval. The entire system is often referred to as a Database Management System or DBMS. A DBMS is a set of frequently complex software programs that are responsible for two main tasks. First, a DBMS controls and manages the overall storage, configuration and retrieval of data in a database. Second, the Database Management System is responsible for ensuring the security and integrity of the database, by granting and restricting access based on the guidelines provided by the administrator of the database. Very often, when someone mentions a new database or talks about the open sourcing of a database (which we will soon talk about ourselves), that person is talking about the Database Management System as a whole, not just the database component.

There are three categories of database worth looking at that are, or have been, in common usage. Importantly, many contemporary DBMSes may offer one, two or a combination of all three databasing methods.

  • Hierarchical databases. These are the oldest of the modern computer databases (as opposed to the ancient computer databases, right?) and may be what most people, when asked, imagine a database to be. A hierarchical database is arranged something like an inverted family tree; access to the database begins at the top of the hierarchy and moves downward. For example, moving from "author" to "works" or "articles" would represent the access pathway in a hierarchical database. The links between these records tend to be fixed ahead of time, either as data is entered or when the database parameters are being established. This contrasts strongly with the relational databases discussed below, where links or "relationships" among data, fields and records is not determined until the data is retrieved. What hierarchical databases once gained in speed by having fixed connections, they lost in flexibility and overall efficiency. And, as computer processors became faster and programming languages more capable, the speed advantage of hierarchical databases became less and less of an advantage. Hierarchical databases were the first databases run on mainframe computers, and as mainframes have given way to more efficient computers, so have hierarchical databases been replaced largely by relational databases.

  • Relational databases. Relational databases are all the rage in terms of database efficiency and power. In fact, most of the databases mentioned toward the end of this discussion - PostgreSQL, Interbase, IBM DB2 - are relational databases. The most important thing to keep in mind about relational databases is that whereas hierarchical databases use fixed relationships between data forms, relational databases are able to compare characteristics of data, fields or records to make new files that may match the criteria of the data retriever. By way of analogy, if a hierarchical database represents "getting out only what you put in (and only in the form in which the data was entered)", then a relational database offers the ability to get, effectively, information that was not specifically "entered" into the database. While this might sound complicated (which it is, somewhat) from the point of the data retriever, a relational database can be a godsend. For example, where a hierarchical database will tell you that "Tom" placed X number of orders for Y number of Z products, a relational database would be able to tell you that "Dick" and "Harry" also bought Z products, or that those who purchased Z products tended to make X number of orders over a given period of time. It should be said that, given the range over which relational databases operate, relational databases can tend to be slow. Because of this, a variety of indexing techniques are used to target "key fields" that may be consulted frequently. Such indexing can significantly speed up the data delivery of relational databases.

  • Network databases. There are a number of definitions for network databases, but the one that is probably most relevant here is the one that explains network databases as data organization methods in which data relationships have a "net-like" or hyperlinked character. By this, I mean that one item in a network database can link to a number of data elements, and can itself be linked and be a variety of data elements. In some ways, a network database resembles a set of hierarchical databases, save for the ability of information to move in more than one direction.

What are some of the popular DBMSes?

The first RDBMS was the Multics Relational Data Store, released in 1976. This database management system is widely regarded as the first relational DBMS ever offered by a major computer manufacturer, in this case Honeywell. Since that time, quite a few relational DBMSes have come to market as both commercial and "free" software. What follows is a listing of Linux Journal magazine and on-line articles that discuss databases and database management systems. Check out the links below, and tune in Tuesday morning at 10:20 a.m. for more.

Linux Journal On-line Articles

Linux Journal Articles

Load Disqus comments