Self-Diagnostic APIs: Software Quality's Next Frontier

Software

by Steve Graves

on April 1, 2004

With embedded software adding intelligence to so many everyday objects, it seems remarkable that the tools used to create these programs aren't smarter when it comes to catching highly destructive bugs. In assigning blame for such errors, one culprit lies in the application programming interfaces (APIs) provided by software publishers. Developers have long chosen libraries of pre-built software for communication, data management, messaging and other purposes rather than creating this functionality from scratch.

But while middleware libraries offer benefits including convenience, portability and productivity, the manner in which they are constructed and used leads to bugs. This stems from the fact that software functions in APIs are nearly always data structure ignorant—they handle data without knowing its type. This severely limits the compiler's and middleware runtime's abilities to perform any validation, greatly increasing the likelihood of programming mistakes slipping through QA.

The potential for a new kind of API that helps to catch and fix such bugs is built into C and C++. API vendors can deliver a programming interface that is data-aware and self-diagnostic by taking advantage of the function argument type-checking ability of every ANSI C/C++ compiler. This article explores the idea by looking at the API of McObject's eXtremeDB, an in-memory database available on Linux—but the idea of a self-diagnostic API applies as well to other middleware categories.

The concept requires us to abandon the old idea that an API must be a static library of functions that is applied in every situation. Instead, the programming interface is generated for each project or implementation of the middleware and therefore is aware of that project's data types.

Database APIs

APIs for database software development kits (SDKs) fall into two categories: interfaces for SQL and navigational interfaces. With navigational interfaces, developers interact with the contents of a database one record at a time. SQL, in contrast, is a set-oriented programming interface. With this API, the user submits SQL statements to the database in order to select, filter, sort and join together rows (records) from many tables. The query results in a set of tuples (rows). The API then is used to fetch the results, either one row at a time or in batches. There is no industry or de-facto standard interface for navigational APIs, so database vendors offering navigational interfaces provide proprietary APIs.

The Safety Issue

Yet pre-defined database APIs, whether SQL or navigational, carry a significant downside: for an interface library to be able to manage data of any database definition, it must have a programming interface that ignores the type of all data. In other words, the database programming interface must treat the data as un-typed, or opaque.

To accomplish this, databases use void pointers to pass data between the database library and the application program. A void pointer is a C/C++ language variable that legally can point to any type of data. With no type, neither the C/C++ compiler nor the database runtime can perform any validation on them. This opens the possibility of passing a pointer to the wrong type of data, with consequences ranging from nonsense data in the database to a corrupted, unusable database to a crashed program.

Let's look at three API examples, one each from Berkeley DB, SQL/ODBC and eXtremeDB. First, let's define a simple database in SQL:

create table make (
   make_id   integer,
   make_name char(20)

   primary key make_id;
)

create table model (
   make_id    integer foreign key references make,
   model_name char(20)
)

Here's how the same database would be defined in eXtremeDB:


declare database cars;

class make
{
   unsigned<4> make_id;
   char<20>    make_name;

   hash <make_id> by_make_id[10000];
};

class model
{
   unsigned<4> make_id; // foreign key of class make
   char<20>    model_name;

   tree <make_id> by_make_id;
};

Berkeley DB does not have a data definition language. Instead, a host program stores name/value pairs, and it is up to the development team to express the organization and interrelationships of the data through source code comments and system documentation.

Next, we write code to populate these databases, first with SQL and ODBC:


insert_make( long make_id, char *make_name )
{
   unsigned char *sql = 
      "insert into make values(?,?)";

   // housekeeping omitted

   SQLBindParameter( handle, // statement handle
      1,      // parameter number
      SQL_PARAM_INPUT, // InputOutputType
      SQL_C_LONG, // ValueType
      SQL_INTEGER, // ParameterType
      0, // ColumnSize ignored for SQL_INTEGER
      0, // DecimalDigits ignored for SQL_INTEGER
      &make_id, // ParameterValuePtr
      sizeof(make_id), // BufferLength
      sizeof(make_id) // ignored
   );
   SQLBindParameter( handle, // statement handle
      2,      // parameter number
      SQL_PARAM_INPUT, // InputOutputType
      SQL_C_CHAR, // ValueType
      SQL_VARCHAR, // ParameterType
      strlen(make_name), // ColumnSize 
      0, // DecimalDigits ignored for SQL_VARCHAR
      make_name, // ParameterValuePtr
      strlen(make_id), // BufferLength
      strlen(make_id) 
   );
   SQLExecute( handle );
}

and then with Berkeley DB:


insert_make( long make_id, char *make_name )
{
   DBT key, data;
 
   /* clear the DBT structures before using */
   memset(&key, 0, sizeof(key));
   memset(&data, 0, sizeof(data));
 
   /* the vehicle make id is a "primary key", 
    * could be some Blue book identification number, 
    * and the make_name is the actual vehicle name
   */
   key.data = make_id;
   key.size = sizeof(long);
 
   strcpy(data.data, make_name);
   data.size = strlen(make_name);
 
   // dbp here is a database handle
   dbp->put(dbp, NULL, &key, &data, 0); 
}

Here's code for eXtremeDB:


insert_make( long make_id, char *make_name )
{
   int rc;

   rc = make_new(transaction_handle, &make_handle);
   rc = make_make_id_put(&make_handle, make_id);
   rc = make_make_name_put(&make_handle, make_name, 
                            strlen(make_name));
}

With the code in place, let's examine the function prototypes for these examples' programming interfaces.

Prototype for SQL/ODBC:

SQLRETURN SQLBindParameter( 
   SQLHSTMT StatementHandle,
   SQLUSMALLINT ParameterNumber,
   SQLSMALLINT InputOutputType,
   SQLSMALLINT ValueType,
   SQLSMALLINT ParameterType,
   SQLUINTEGER ColumnSize,
   SQLSMALLINT DecimalDigits,
   SQLPOINTER ParameterValuePtr,
   SQLINTEGER BufferLength,
   SQLINTEGER *StrLen_or_IndPtr);

Prototype for Berkeley DB:

DB->put( DB *db, DB_TXN *txnid, DBT *key, DBT *data, 
         u_int32_t flags);

where DBT is
   typedef struct {
      void *data;
      u_int32_t ulen;
      // a bunch of other stuff
   } DBT;

Prototype for eXtremeDB:

MCO_RET make_new(mco_trans_h t, make *handle);
MCO_RET make_make_id_put(make *handle, 
                         uint4 make_id);
MCO_RET make_make_name_put(make *handle, 
                           char *make_name, 
                           uint2 len);

In the ODBC and BerkeleyDB prototypes, the data is passed through the programming interface as void pointers. (SQLPOINTER is a typdef for void * in sqltypes.h) This creates the potential for the programmer to code an incorrect argument. Neither the C/C++ compiler nor the database runtime can perform any validation to catch several common types of error.

The ODBC interface presents several potential mistakes. An incorrect StatementHandle could be passed, an incorrect ParameterNumber coded or an incorrect ParameterValuePtr used. A mistake in any of these three arguments causes the ParameterValuePtr to be gibberish.

With Berkeley DB, the programmer has other opportunities for mistakes. The DBT structure passed as the key or data arguments can be incorrect (wrong key for the data or vice-versa), the void *data could be invalid or the ulen parameter that is supposed to tell the length of the buffer referenced by void *data could be incorrect.

An error in coding the function arguments results in the database runtime putting data into a location in the database it was not intended for, for example putting make data into a place the database has designated for model data. This can cause gibberish to be stored in the database or, worse, cause the database runtime to try to read beyond the end of the program's stack and produce a memory violation.

Reading data from the database when such an error is present entails its own risks. For example, attempting to read data that is N bytes wide in the database into a program variable that is less than N bytes wide causes the database to overwrite random locations in memory. This can lead to critical data being overwritten, causing a crash, or to database corruption as important runtime structures are overwritten.

Such mistakes typically result not from unskilled programming but from labor-saving shortcuts, such as cutting and pasting blocks of code. Many repetitive programming tasks, such as calling middleware APIs, invite copying. For example, the fundamental steps for instantiating a new record in a database and populating its fields are the same for every type of record, differing only in the number and data types of fields. So having written such code for one record type, programmers often cut and paste—sometimes overlooking basic cleanup editing such as changing data types. The use of void pointers strips the C/C++ compiler and runtime of the ability to detect such errors.

The Safety Solution: a Self-Diagnostic API

The potential to create a better API that catches such mistakes and also reduces the API learning curve has existed since function prototypes were first introduced in the 1980s. When the C++ language emerged, it included function prototypes, the signatures of functions. A prototype declares the name of a function, its number of arguments (parameters), each argument's data type and the data type of the function's return value. If a function's use doesn't match its signature, the compiler flashes an error message, and the offending code must be corrected before the program can be successfully compiled. Function prototypes were such an improvement that they eventually were incorporated into the C language as well.

Exploiting the function argument type checking ability of every ANSI C/C++ compiler can lead to a programming interface that is data-aware, and thus catches many more types of mistake during compilation. Harnessing the ANSI C compiler's function prototyping in the service of greater error-catching means abandoning the idea of a API as a static library of functions. In the case of databases, the programming interface is specific to each database design and therefore is aware of the data types of that design.

This is the approach taken with the programming interface for eXtremeDB. Although this in-memory database has a small, static API for common tasks (opening a database, establishing a connection and so on), the majority of the API—the functions concerned with populating, searching and reading the data—is generated dynamically from the database definition.

Users create the database using the eXtremeDB database definition language (DDL), which is typed into a text file and processed by a compiler, mcocomp. The compiler validates the DDL statement syntax and generates <dbname>.c and <dbname>.h files that developers include in their application projects. These files define the programming interface for that unique database, and they include function prototypes and implementations to address every type of class and index. Each interface is purpose-specific for a certain data element and operation, so the element's type is accounted for in the interface definition.

Let's look again at the eXtremeDB database definition for the examples used previously:


declare database cars;

class make {
   int4   make_id;
   string make_name;

   hash <make_id> by_make_id[1000];
};

class model {
   int4   make_id; // foreign key of make.make_id
   string model_name;

   tree <make_id> by_make_id;
};

For this database definition, mcocomp generates the cars.h and cars.c files that contain, among other things, the following generated interfaces for make and model (only a subset of the complete generated interface is shown here):

/*-------------------------------------------------*/
/* class make methods                              */

MCO_RET make_new        (mco_trans_h t, 
                         make *handle);
MCO_RET make_delete     (make *handle);
MCO_RET make_make_id_get(make *handle, 
                         int4 *result);
MCO_RET make_make_id_put(make *handle, int4 value);
MCO_RET make_by_make_id_find(mco_trans_h t, 
                             int4 make_id, 
                             make *handle);

/*-------------------------------------------------*/
/* class model methods                             */

MCO_RET model_new              ( mco_trans_h t, 
                                 model *handle );
MCO_RET model_delete           ( model *handle );
MCO_RET model_make_id_get      ( model *handle, 
                                 int4 * result);
MCO_RET model_make_id_put      ( model *handle, 
                                 int4 value );
MCO_RET model_by_make_id_search( mco_trans_h t, 
                                 mco_cursor_h c, 
                                 MCO_OPCODE op_, 
                                 int4 make_id );

Every argument type is matched to its corresponding data type as declared in the database definition. Consider make_new(), the equivalent of BerkeleyDB's DB->Put() and SQL/ODBC's INSERT statements and SQLExecute() function:

MCO_RET make_new( mco_trans_h t, make *handle );

This function requires a handle to an eXtremeDB transaction and a handle to a make object. If a programmer accidentally codes a handle to any other type of object, the ANSI C/C++ compiler issues a fatal error and the programmer must correct the mistake. In addition, the make_new interface's database context is contained within the transaction handle, making it impossible to reference the wrong database.

This approach has the additional benefit of creating a more intuitive, easier-to-learn programming interface. The eXtremeDB-generated interfaces for make and model, above, are more readable and self-documenting than are functions from a static interface designed for use with an infinite variety of database designs. The developer reading model_by_make_id_search() and model_model_name_put() knows exactly what operation is being carried out and on what data.

Intuitive, self-diagnostic programming interfaces that are specific to a given project lead to greater programmer productivity in the beginning stages of the project, which extends through the entire life cycle of the software. Although a new interface emerges for each project, simple rules govern its generation and use. Understanding a project's data model, along with the few simple rules for generating the API, means the developer is equipped to use the self-diagnostic API quickly and productively—with a lowered risk of introducing destructive bugs.

Steve Graves is cofounder and CEO of McObject.

Load Disqus comments