Native XML Data Storage and Retrieval

A new generation of databases creates a new set of decisions and several full-featured ways to build queries.

The design and implementation trade-offs within a native XML database make a significant impact on the performance, scalability and features available to applications that use it. This article focuses on the granularity of stored XML documents and indexing as two of the most critical design considerations. Berkeley DB XML from Sleepycat Software (www.sleepycat.com/products/xml.shtml) is the basis for this discussion.

The basic functions of an XML database are to store documents, query over documents and handle query results. Of course, indexes are required to obtain acceptable query performance.

In a relational database, pieces of a relational table are stored, queries are SQL and results are tabular. This abstraction and standardization is useful from an application developer's perspective. Developers have less visibility into precisely how documents are stored and indexed and how a query can leverage the combination of storage format, indexes and query language to answer a question quickly.

The same concepts exist in a native XML database, such as Berkeley DB XML. In this case, the data is the XML document and the query may be an XPath or XQuery expression. The results may be XML documents, DOM, SAX or a proprietary form. Within a native XML database, mechanisms for storage, indexing and querying are not obvious from the perspective of an application developer, yet they are critical to the function, performance and scalability of the overall system.

A native XML database exposes a logical model of storing and retrieving XML documents; however, its internal storage model may not be equivalent to the document. Indexing is a crucial component of any database. Without intelligent indexing, a database is little better than a filesystem for information retrieval. Query processing builds on both storage format and indexes but is beyond the scope of this article.

Storage Formats and Granularity

Most native XML databases are oriented toward storing XML documents, where a key issue is the granularity with which the document is stored. In database terms, granularity can be described in several different ways: external access, internal addressability and concurrency.

A distinction is made between access granularity and addressability. Addressability refers to objects that can be named and accessed directly, without navigation, within the system. Access may be provided through a DOM to a system with an addressable granularity of an XML document, by parsing the document. In this sense, access granularity is user-visible, while addressability is an internal concept. Concurrency means how objects can be modified concurrently, if such a feature is supported.

Intact Document Storage

There are two major choices in terms of how to store a document—intact or not intact. Systems that store XML documents intact usually parse the XML in order to ensure it is well formed and valid but otherwise store documents unchanged. This is useful for applications that require retrieval of the entire byte-for-byte document or for round tripping. Furthermore, for relatively small documents that tend to be retrieved and processed whole, such a system is ideal. The major issue for intact document storage is how to address target documents within a collection of documents. There are two primary mechanisms to do this: a unique identifier, such as name or document ID, or a query expression, such as XQuery. The first results in exactly one document, whereas the latter may return many documents in a result set.

For a large collection, it must be possible to target a small set of result documents in a query. For intact document storage, this implies an indexing mechanism. If a document is parsed upon insertion into a collection, it can be indexed as well, based on the system's indexing specifications. Indexes in this type of system use document granularity addressing. It is desirable to avoid parsing documents in order to resolve a query. Additional parsing can be avoided if the query can be answered definitively from indexes and the access granularity desired by the application is at the document level, as opposed to DOM granularity access.

A clear disadvantage of intact document storage is that for certain applications and queries, it can take a long time and a large amount of memory to process a request. This is mostly due to the need to parse documents to satisfy a query. Optimizations, such as references to offsets within a document, can be made, however, for read-only documents.

The advantages of intact document storage include its simplicity and byte-for-byte round tripping. Berkeley DB XML has an option to store documents intact.

______________________

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState