autoSql and autoXml: Code Generators from the Genome Project
Moving data from one source to another is not all that difficult in the scheme of things. If your source data is a tab-delimited file, for example, and you need it in an SQL relational database, you might write a little SQL definition, then churn out a C program to read the data from the source file and write it out to the database. But when you're dealing with a big project, or in our case, really big, and you find yourself working with dozens of sources giving you gigabytes of data, writing all that code gets old fast.
To solve this problem, here are two tools to do the job. Together, they generate database definitions for SQL, write C header files with your data definitions and function prototypes, write C code to get data to and from C structures and generate C code for an XML parser.
The human genome is the instruction manual that is encoded in our DNA. It is made up of three billion pairs of chemical letters, commonly known by the initials G, C, A and T. The genome data is 24 long strands of these letters—not exactly light reading. The Human Genome Browser is a web site at the University of California, Santa Cruz that gives scientists around the world a visual representation of this mountain of data. The browser combines the sequence data itself with higher-level annotations of the function of particular regions of the genome. Users can locate and zoom in on genes they are interested in, link to research conducted on that section of the genome and compare the genomic data with that of other species. The browser stacks particular types of annotations as tracks beneath genome coordinate positions.
The Genome Browser has an HTML/CGI front end that lets the user view and (with the help of dynamically generated image maps) click on genome tracks. Form fields give the user a way to set zoom level and control the data density of the tracks. The CGI source code is C, and the genome data is stored in an SQL database.
There is a lot of data. The browser source code has to deal with data formats for gene prediction and for similarities between the human genome and the genomes of other species. Complicating matters is the fact that we collaborate with at least a dozen external sources that each have data in their own format. Even if we don't want to use their data formats internally, we still need to write a parser to read them in and convert them to our own format. Probably half of our use of the code generators is to make it easier to import files from other groups.
autoSql generates SQL and C code for saving and loading data to a database. By using autoSql, we don't need to write the tedious data definition, which involves reading and writing code. For example, the browser has around 30 public tracks and 30 experimental tracks. Each track is associated with a table in a relational database. All of the modules that load a track table into memory are generated by autoSql.
Later in the project, we started using XML to collaborate with a research group in Japan. XML is also useful to exchange data with other public sites via DAS (the Distributed Annotation System, a protocol for transferring genomic data over the Internet).
autoXml generates C code for an XML parser given an XML DTD file. Since XML I/O is even more code intensive than SQL I/O, autoXml has already proven to be useful.
Together, autoSql and autoXml have proved to be invaluable time-savers. autoSql has been a critical workhorse to the browser project. At 1,200 lines, it has generated fully half the browser program, tens of thousands of lines of code.
Although we don't use XML as much as SQL, we've already broken even with autoXml. In a single project to import data from the Riken mouse genome annotation project in Japan, autoXml generated approximately 1,500 lines of code. (It's only 1,200 lines itself.)
You can download the binaries for autoSql and autoXml from www.soe.ucsc.edu/~kent/exe. The source code, Linux executables and examples from this article are at www.soe.ucsc.edu/~kent/src/autoCode.tgz.
autoSql is a program that automatically generates an SQL table creation script and C code for saving and loading data to a database based on an object specification. (See Figure 1 for an overview of this process.)
The specification language is a bit quirky, but it has proven effective for many jobs. We originally developed autoSql for use with a relational database; it turns out that it generates code that can load from many flat formats as well, as long as they are in a text format.
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Google's Abacus Project: It's All about Trust
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- Seeing Red and Getting Sleep
- Secure Desktops with Qubes: Introduction
- Back to Backups
- Fancy Tricks for Changing Numeric Base
- Working with Command Arguments
- Secure Desktops with Qubes: Installation
- Linux Mint 18
- CentOS 6.8 Released
Until recently, IBM’s Power Platform was looked upon as being the system that hosted IBM’s flavor of UNIX and proprietary operating system called IBM i. These servers often are found in medium-size businesses running ERP, CRM and financials for on-premise customers. By enabling the Power platform to run the Linux OS, IBM now has positioned Power to be the platform of choice for those already running Linux that are facing scalability issues, especially customers looking at analytics, big data or cloud computing.
￼Running Linux on IBM’s Power hardware offers some obvious benefits, including improved processing speed and memory bandwidth, inherent security, and simpler deployment and management. But if you look beyond the impressive architecture, you’ll also find an open ecosystem that has given rise to a strong, innovative community, as well as an inventory of system and network management applications that really help leverage the benefits offered by running Linux on Power.Get the Guide