autoSql and autoXml: Code Generators from the Genome Project

These tools have saved us from the drudger of writing tens of thousands of lines of repetitive code—we hope you find them useful.
Types of Objects

autoSql has three types of objects:

  • Simple: objects that contain no variable-sized arrays.

  • Object: objects that can contain variable-sized arrays. A next pointer is automatically inserted as the first field in the C structure corresponding to an object.

  • Table: like objects, but the program generates an SQL as well as a C definition.

Simple objects differ from other objects in how the program treats array declarations. In the field declaration:

simple point[3] triangle;  "A three sided figure"

the three points are stored in memory as a C array. If this were declared instead as

object point[3] triangle;  "A three sided figure"
the three points would be stored in memory as a singly linked list.

Types of Fields

The following basic field types are supported:

  • int: 32-bit signed integer

  • uint: 32-bit unsigned integer

  • short: 16-bit signed integer

  • ushort: 16-bit unsigned integer

  • byte: 8-bit signed integer

  • ubyte: 8-bit unsigned integer

  • float: single precision IEEE floating point

  • char: 8-bit character (can only be used in an array)

  • string: variable length string up to 255 bytes long

  • lstring: variable length string up to 2 billion bytes long

Additionally, the simple, object and table types can be used as fields.

Fixed Length and Variable Array Declarations

An array can be declared as either fixed size or variable size. A variable sized array is declared by putting a field name inside of the brackets in the array declaration. This field must be defined before the array.

A More Complicated Example

Imagine that you've just built an amazing 3-D modeling program. The only problem is that now you need to save the structures in a database. Listing 1 is a way you might build the database with autoSql. Saving it as and running

autoSql threeD

would end up generating 393 lines of bug-free (I think!) C code and 14 lines of SQL for the investment of 33 lines of specification. (Refer to Listing 2 for the complete autoSql grammar.)

Listing 1. Building the Database with autoSql

Listing 2. autoSql Grammar

autoXml Overview

autoXml generates C code for an XML parser given an XML DTD file. It will generate a structure for each “element” in the DTD and populate the structure with fields for each attribute of the structure. By default, it will generate a parser that ignores elements and attributes not in the DTD, but otherwise is a validating parser. If you use the -picky flag, it will be fully validating.

The autoXml parser will load the entire file into memory. If this is a problem you'll have to resort to the lower-level xap parser, which is much like the commonly used expat parser, but a bit faster.

A Short XML and DTD Tutorial

If you find yourself befuddled by all the acronyms so far, you're probably new to XML (eXtensible Markup Language). It has a tag-based format, and a simple example of an XML doc might be:

   <POLYGON id="square">
         This is soooo square man
      <POINT x="0" y="0" ->
      <POINT x="0" y="1" ->
      <POINT x="1" y="1" ->
      <POINT x="1" y="0" ->

Everything in XML lives between <TAG></TAG> pairs. A tag may have associated text, attributes and subtags. In the example above, POLYGON has the subtags DESCRIPTION and POINT, the attribute id and no text. DESCRIPTION has the text “This is soooo square man” and no subtags or attributes. POINT has the attributes x and y. POINT also illustrates a little XML shortcut: tags containing only attributes can be written <TAG att=“something” -> as a shortcut for <TAG att=“something”></TAG>.

XML is much like HTML but has significant differences. All attributes must be enclosed in quotes in XML, while quotes are optional in HTML. Tags must strictly nest in XML, while HTML allows tags to be opened but not closed. The tags in HTML are predefined. In XML the definition of tags is up to you.

Tags can be defined two ways in XML: by a DTD file or by an XML schema. There are pros and cons for each method. DTD files are relatively simple and are recognized by a wide variety of parsers and XML browsers. On the other hand, DTD files can't express that a certain attribute has to be numerical. XML schemas are more complex. They are themselves written in a type of XML, which is nice in some ways. They are not as widely supported yet. Currently autoXml only works with DTD files with some modest extensions.

Here is a DTD file that would describe the POLYGON format above:


The DTD has two major types of definitions: ELEMENTs and ATTLISTs (or attributes). An element definition includes the name of the element and an optional parenthesized list of sub-elements. The sub-elements must be defined elsewhere in the DTD with the exception of the #PCDATA sub-element, which is used to indicate that the element can have text between its tags. Each sub-element may be followed by one of the following characters:

  • ?: the sub-element is optional.

  • +: the sub-element occurs at least once.

  • *: the sub-element occurs 0 or more times.

If there is no following character the sub-element occurs exactly once.

The ATTLIST defines an attribute and associates it with an element. It is good style to keep ATTLISTs together with their ELEMENT. Here are the fields in an ATTLIST:

  • element: name of element this is associated with.

  • name: name of this attribute.

  • type: generally CDATA. Can be a reference or date, but these are not supported by autoXml.

  • default: this contains a default value to be used if the attribute is not present. The keyword #REQUIRED in this field means that the attribute must be present. The keyword #IMPLIED means that it's okay for this attribute to be missing (in which case it will have a NULL or zero value after it is read by autoXml).