Self-Diagnostic APIs: Software Quality's Next Frontier
With embedded software adding intelligence to so many everyday
objects, it seems remarkable that the tools used to create
these programs aren't smarter when it comes to catching highly
destructive bugs.
In assigning blame for such errors, one culprit lies in the
application programming interfaces (APIs) provided by software
publishers. Developers have long chosen libraries of pre-built
software for communication, data management, messaging and
other purposes rather than creating this functionality from scratch.
But while middleware libraries offer benefits including
convenience, portability and productivity, the manner in which
they are constructed and used leads to bugs. This stems from
the fact that software functions in APIs are nearly always
data structure ignorant—they handle data without knowing
its type. This severely limits the compiler's and middleware
runtime's abilities to perform any validation, greatly increasing
the likelihood of programming mistakes slipping through QA.
The potential for a new kind of API that helps to
catch and fix such bugs is built into C and C++. API vendors
can deliver a programming interface that is data-aware and
self-diagnostic by taking advantage of the function
argument type-checking ability of every ANSI C/C++ compiler.
This article explores the idea by looking at the API of
McObject's eXtremeDB, an in-memory database available on Linux—but the idea of a self-diagnostic API applies as well to
other middleware categories.
The concept requires us to abandon the old idea that an API
must be a static library of functions that is applied in every
situation. Instead, the programming interface is generated
for each project or implementation of the middleware and
therefore is aware of that project's data types.
Database APIs
APIs for database software development kits (SDKs) fall into
two categories: interfaces for SQL and navigational
interfaces.
With navigational interfaces, developers interact with the
contents of a database one record at a time. SQL, in contrast,
is a set-oriented programming interface. With this API, the
user submits SQL statements to the database in order to select,
filter, sort and join together rows (records) from many tables.
The query results in a set of tuples (rows). The API then is
used to fetch the results, either one row at a
time or in batches. There is no industry or de-facto standard
interface for navigational APIs, so database vendors offering
navigational interfaces provide proprietary APIs.
The Safety Issue
Yet pre-defined database APIs, whether SQL or navigational, carry a significant downside: for an interface library to be
able to manage data of any database definition, it must have
a programming interface that ignores the type of all data.
In other words, the database programming interface must treat
the data as un-typed, or opaque.
To accomplish this, databases use void pointers to pass data
between the database library and the application program.
A void pointer is a C/C++ language variable that
legally can point to any type of data. With no type, neither
the C/C++ compiler nor the database runtime can perform any
validation on them. This opens the possibility of passing a
pointer to the wrong type of data, with consequences ranging
from nonsense data in the database to a corrupted, unusable
database to a crashed program.
Let's look at three API examples, one each from Berkeley DB, SQL/ODBC
and eXtremeDB. First, let's define a simple database in SQL:
create table make ( make_id integer, make_name char(20) primary key make_id; ) create table model ( make_id integer foreign key references make, model_name char(20) )
Here's how the same database would be defined in eXtremeDB:
declare database cars;
class make
{
unsigned<4> make_id;
char<20> make_name;
hash <make_id> by_make_id[10000];
};
class model
{
unsigned<4> make_id; // foreign key of class make
char<20> model_name;
tree <make_id> by_make_id;
};
Berkeley DB does not have a data definition language. Instead, a host program
stores name/value pairs, and it is up to the development team to express the
organization and interrelationships of the data through source code comments
and system documentation.
Next, we write code to populate these databases, first with SQL and
ODBC:
insert_make( long make_id, char *make_name )
{
unsigned char *sql =
"insert into make values(?,?)";
// housekeeping omitted
SQLBindParameter( handle, // statement handle
1, // parameter number
SQL_PARAM_INPUT, // InputOutputType
SQL_C_LONG, // ValueType
SQL_INTEGER, // ParameterType
0, // ColumnSize ignored for SQL_INTEGER
0, // DecimalDigits ignored for SQL_INTEGER
&make_id, // ParameterValuePtr
sizeof(make_id), // BufferLength
sizeof(make_id) // ignored
);
SQLBindParameter( handle, // statement handle
2, // parameter number
SQL_PARAM_INPUT, // InputOutputType
SQL_C_CHAR, // ValueType
SQL_VARCHAR, // ParameterType
strlen(make_name), // ColumnSize
0, // DecimalDigits ignored for SQL_VARCHAR
make_name, // ParameterValuePtr
strlen(make_id), // BufferLength
strlen(make_id)
);
SQLExecute( handle );
}
insert_make( long make_id, char *make_name )
{
DBT key, data;
/* clear the DBT structures before using */
memset(&key, 0, sizeof(key));
memset(&data, 0, sizeof(data));
/* the vehicle make id is a "primary key",
* could be some Blue book identification number,
* and the make_name is the actual vehicle name
*/
key.data = make_id;
key.size = sizeof(long);
strcpy(data.data, make_name);
data.size = strlen(make_name);
// dbp here is a database handle
dbp->put(dbp, NULL, &key, &data, 0);
}
Here's code for eXtremeDB:
insert_make( long make_id, char *make_name )
{
int rc;
rc = make_new(transaction_handle, &make_handle);
rc = make_make_id_put(&make_handle, make_id);
rc = make_make_name_put(&make_handle, make_name,
strlen(make_name));
}
With the code in place, let's examine the function prototypes for these
examples' programming interfaces.
Prototype for SQL/ODBC:
SQLRETURN SQLBindParameter( SQLHSTMT StatementHandle, SQLUSMALLINT ParameterNumber, SQLSMALLINT InputOutputType, SQLSMALLINT ValueType, SQLSMALLINT ParameterType, SQLUINTEGER ColumnSize, SQLSMALLINT DecimalDigits, SQLPOINTER ParameterValuePtr, SQLINTEGER BufferLength, SQLINTEGER *StrLen_or_IndPtr);
Prototype for Berkeley DB:
DB->put( DB *db, DB_TXN *txnid, DBT *key, DBT *data,
u_int32_t flags);
where DBT is
typedef struct {
void *data;
u_int32_t ulen;
// a bunch of other stuff
} DBT;
Prototype for eXtremeDB:
MCO_RET make_new(mco_trans_h t, make *handle);
MCO_RET make_make_id_put(make *handle,
uint4 make_id);
MCO_RET make_make_name_put(make *handle,
char *make_name,
uint2 len);
In the ODBC and BerkeleyDB prototypes, the data is passed
through the programming interface as void pointers. (SQLPOINTER
is a typdef for void * in sqltypes.h) This creates the
potential for the programmer to code an incorrect argument.
Neither the C/C++ compiler nor the database runtime can
perform any validation to catch several common types of error.
The ODBC interface presents several potential mistakes. An incorrect
StatementHandle could be passed, an incorrect ParameterNumber coded
or an incorrect ParameterValuePtr used. A mistake in any of these three
arguments causes the ParameterValuePtr to be gibberish.
With Berkeley DB, the programmer has other opportunities
for mistakes. The DBT structure passed as the key or
data arguments can be incorrect (wrong key for the data
or vice-versa), the void *data could be invalid or the
ulen parameter that is supposed to tell the length of the
buffer referenced by void *data could be incorrect.
An error in coding the function arguments results in the
database runtime putting data into a location in the database
it was not intended for, for example putting make data into
a place the database has designated for model data. This
can cause gibberish to be stored in the database or, worse,
cause the database runtime to try to read beyond the end of
the program's stack and produce a memory violation.
Reading data from the database when such an error is present
entails its own risks. For example, attempting to read data
that is N bytes wide in the database into a program variable
that is less than N bytes wide causes the database
to overwrite random locations in memory. This can lead to
critical data being overwritten, causing a crash, or to database
corruption as important runtime structures are overwritten.
Such mistakes typically result not from unskilled programming
but from labor-saving shortcuts, such as cutting and pasting
blocks of code. Many repetitive programming tasks, such
as calling middleware APIs, invite copying. For example,
the fundamental steps for instantiating a new record in a
database and populating its fields are the same for every
type of record, differing only in the number and data types
of fields. So having written such code for one record type,
programmers often cut and paste—sometimes overlooking
basic cleanup editing such as changing data types. The use
of void pointers strips the C/C++ compiler and runtime of
the ability to detect such errors.
The Safety Solution: a Self-Diagnostic API
The potential to create a better API that catches such mistakes
and also reduces the API learning curve has existed since
function prototypes were first introduced in the 1980s. When
the C++ language emerged, it included function prototypes,
the signatures of functions. A prototype declares the
name of a function, its number of arguments (parameters),
each argument's data type and the data type of the function's
return value. If a function's use doesn't match its signature,
the compiler flashes an error message, and the offending code
must be corrected before the program can be successfully
compiled. Function prototypes were such an improvement that
they eventually were incorporated into the C language as well.
Exploiting the function argument type checking ability of
every ANSI C/C++ compiler can lead to a programming interface
that is data-aware, and thus catches many more types of
mistake during compilation. Harnessing the ANSI C compiler's
function prototyping in the service of greater error-catching
means abandoning the idea of a API as a static library of
functions. In the case of databases, the programming interface
is specific to each database design and therefore is aware of
the data types of that design.
This is the approach taken with the programming interface
for eXtremeDB. Although this in-memory database has a small,
static API for common tasks (opening a database, establishing
a connection and so on), the majority of the API—the functions
concerned with populating, searching and reading the data—is generated dynamically from the database definition.
Users create the database using the eXtremeDB database
definition language (DDL), which is typed into a text file and
processed by a compiler, mcocomp. The compiler validates the
DDL statement syntax and generates <dbname>.c and <dbname>.h
files that developers include in their application projects.
These files define the programming interface for that unique
database, and they include function prototypes and implementations
to address every type of class and index. Each interface is
purpose-specific for a certain data element and operation,
so the element's type is accounted for in the interface
definition.
Let's look again at the eXtremeDB database definition for the examples used
previously:
declare database cars;
class make {
int4 make_id;
string make_name;
hash <make_id> by_make_id[1000];
};
class model {
int4 make_id; // foreign key of make.make_id
string model_name;
tree <make_id> by_make_id;
};
For this database definition, mcocomp generates the cars.h and cars.c files
that
contain, among other things, the following generated interfaces for make and
model (only a subset of the complete generated interface is shown here):
/*-------------------------------------------------*/
/* class make methods */
MCO_RET make_new (mco_trans_h t,
make *handle);
MCO_RET make_delete (make *handle);
MCO_RET make_make_id_get(make *handle,
int4 *result);
MCO_RET make_make_id_put(make *handle, int4 value);
MCO_RET make_by_make_id_find(mco_trans_h t,
int4 make_id,
make *handle);
/*-------------------------------------------------*/
/* class model methods */
MCO_RET model_new ( mco_trans_h t,
model *handle );
MCO_RET model_delete ( model *handle );
MCO_RET model_make_id_get ( model *handle,
int4 * result);
MCO_RET model_make_id_put ( model *handle,
int4 value );
MCO_RET model_by_make_id_search( mco_trans_h t,
mco_cursor_h c,
MCO_OPCODE op_,
int4 make_id );
Every argument type is matched to its corresponding data type as
declared in the database definition. Consider make_new(), the equivalent of
BerkeleyDB's DB->Put() and SQL/ODBC's INSERT statements and SQLExecute()
function:
MCO_RET make_new( mco_trans_h t, make *handle );
This function requires a handle to an eXtremeDB transaction and a handle to a
make object. If a programmer accidentally codes a handle to any other type of
object, the ANSI C/C++ compiler issues a fatal error and the programmer must
correct the mistake. In addition,
the make_new interface's database context is contained within the
transaction handle, making it impossible to reference the wrong database.
This approach has the additional benefit of creating a more intuitive,
easier-to-learn programming interface. The eXtremeDB-generated
interfaces for make and model, above, are more readable and self-documenting
than are functions from a static interface designed for use with an infinite
variety of database designs. The developer reading model_by_make_id_search()
and model_model_name_put() knows exactly what operation is being carried out and
on what data.
Intuitive, self-diagnostic programming interfaces that are specific to a given
project lead to greater programmer productivity in the beginning stages of
the project, which extends through the entire life cycle of the software.
Although a
new interface emerges for each project, simple rules govern its
generation and use. Understanding a project's data model, along with the few
simple rules for generating the API, means the developer is equipped to use the
self-diagnostic API quickly and productively—with a lowered risk of
introducing destructive bugs.
Steve Graves is cofounder and CEO of McObject.










This week 5 lucky Members will receive a copy of The Official Ubuntu Server Book by Benjamin Mako Hill and Linux Journal's very own Kyle Rankin. No entry necessary. Check back here early next week to find out who the lucky Online Members are.




Comments
If you must wear the static typing straightjacket...
There is a deeper issue here of static versus dynamic typing. Languages like C++, Java etc. try to carefully set out the types of everything in advance, while Python, Perl, etc. do not require any pre-declaration.
The problem is that SQL is stuck in between. Given an arbitarary SELECT statement, it is not trivial to mechanically determine the type of the returned fields without knowing the current schema AND all the functions built into the particular database product we are talking to. The programmer knows, but the compiler won't. And if you restrict the kind of SELECTs you allow, you throw out half the power of a RDBMS.
Conversely, you cannot throw any value into any field, because SQL tables do have defined types.
Finally, the schema definition language (DDL) is distinct from the code in which you are accessing the database, so it is very difficult to bring them together to check any of this.
The problem with the article is that it assumes we want to extend the static typing straightjacket out into the DBMS. If you must use C++, this makes a certain amount of sense. I can certainly see advantages in this, but it also makes a lot of work.
Just remember, folks, that the impedance mismatch is much less when using SQL from a language that can go the other way and take the dynamic types as they come. Particularly for rapid development where performance is not critical - and many databases will have upgrade or maintainence procedures that are in this category. The pain and error-prone ODBC binding the author describes vanishes when using Python with ODBC.
SQL is a raw itch for statically typed languages, and I'm not sure that eXtremeDB magics the mismatch away as much as the author would like you to think.
F-G
Re: Self-Diagnostic APIs: Software Quality's Next Frontier
if you want to write an ad for your proprietary product, I think it is best to just write an ad, not pretend you are inventing new programming techniques. the tone taken in this article is simply insulting.
Not so new...
I think the following could be considered "prior-art", if you will...
MFC's database bindings create type-correct source code interface to tables, based on the tables' interface.
CORBA's IDL will create application-specific client-side classes that are structured according to the messaging requirements. I think these are type-correct.
ObjectStore is an object-oriented database. You create a C++ class, and ObjectStore will figure out how to store that thingy in a database. Again, the compiler ensures typesafety.
So while eXtremeDB's technique might be nice, it's probably not an entirely new approach.
Re: Not so new...
If you use EJB you have strong typed functions to peform changes in the underlaying DB. The only difference that MFC wrapper is generated based on existing DB schema, and EJB uses type information to generate scheme for DB. So nothing new in this technique at all. The whole article sounds like product advertisement, not technical article.
Post new comment