Native XML Data Storage and Retrieval
The design and implementation trade-offs within a native XML database make a significant impact on the performance, scalability and features available to applications that use it. This article focuses on the granularity of stored XML documents and indexing as two of the most critical design considerations. Berkeley DB XML from Sleepycat Software (www.sleepycat.com/products/xml.shtml) is the basis for this discussion.
The basic functions of an XML database are to store documents, query over documents and handle query results. Of course, indexes are required to obtain acceptable query performance.
In a relational database, pieces of a relational table are stored, queries are SQL and results are tabular. This abstraction and standardization is useful from an application developer's perspective. Developers have less visibility into precisely how documents are stored and indexed and how a query can leverage the combination of storage format, indexes and query language to answer a question quickly.
The same concepts exist in a native XML database, such as Berkeley DB XML. In this case, the data is the XML document and the query may be an XPath or XQuery expression. The results may be XML documents, DOM, SAX or a proprietary form. Within a native XML database, mechanisms for storage, indexing and querying are not obvious from the perspective of an application developer, yet they are critical to the function, performance and scalability of the overall system.
A native XML database exposes a logical model of storing and retrieving XML documents; however, its internal storage model may not be equivalent to the document. Indexing is a crucial component of any database. Without intelligent indexing, a database is little better than a filesystem for information retrieval. Query processing builds on both storage format and indexes but is beyond the scope of this article.
Most native XML databases are oriented toward storing XML documents, where a key issue is the granularity with which the document is stored. In database terms, granularity can be described in several different ways: external access, internal addressability and concurrency.
A distinction is made between access granularity and addressability. Addressability refers to objects that can be named and accessed directly, without navigation, within the system. Access may be provided through a DOM to a system with an addressable granularity of an XML document, by parsing the document. In this sense, access granularity is user-visible, while addressability is an internal concept. Concurrency means how objects can be modified concurrently, if such a feature is supported.
There are two major choices in terms of how to store a document—intact or not intact. Systems that store XML documents intact usually parse the XML in order to ensure it is well formed and valid but otherwise store documents unchanged. This is useful for applications that require retrieval of the entire byte-for-byte document or for round tripping. Furthermore, for relatively small documents that tend to be retrieved and processed whole, such a system is ideal. The major issue for intact document storage is how to address target documents within a collection of documents. There are two primary mechanisms to do this: a unique identifier, such as name or document ID, or a query expression, such as XQuery. The first results in exactly one document, whereas the latter may return many documents in a result set.
For a large collection, it must be possible to target a small set of result documents in a query. For intact document storage, this implies an indexing mechanism. If a document is parsed upon insertion into a collection, it can be indexed as well, based on the system's indexing specifications. Indexes in this type of system use document granularity addressing. It is desirable to avoid parsing documents in order to resolve a query. Additional parsing can be avoided if the query can be answered definitively from indexes and the access granularity desired by the application is at the document level, as opposed to DOM granularity access.
A clear disadvantage of intact document storage is that for certain applications and queries, it can take a long time and a large amount of memory to process a request. This is mostly due to the need to parse documents to satisfy a query. Optimizations, such as references to offsets within a document, can be made, however, for read-only documents.
The advantages of intact document storage include its simplicity and byte-for-byte round tripping. Berkeley DB XML has an option to store documents intact.
Some native XML databases, such as Berkeley DB XML, store documents with granularity finer than the document. The properties of such systems include: addressability is subdocument level, access granularity is subdocument level and concurrency granularity may or may not be finer than document level.
Storing documents in pieces offers a number of advantages, including:
Ability to reference an element or other object within a document directly.
Ability to retrieve partial documents without parsing.
Efficient querying, without parsing, by materializing only those parts of a document necessary to evaluate the query.
Ability to modify a small piece of a large document.
The decision to store documents in pieces results in more choices:
Degree of round tripping supported, if any.
What information is stored or the data model of the storage.
Granularity of addressability.
Support for partial document modification, without rewriting the entire document.
Physical format of information.
Fine-grained document storage systems must choose the degree of round tripping supported if it is a requirement to be able to return the original document, byte for byte. Virtually any decomposition of a document for storage results in loss or change of information, such as reordering of attributes, or a change in the XML declaration. This is because there is not a 1:1 mapping from XML infoset to bytes in a document. That is, there are bytes within an XML document that are not considered relevant to the infoset and, therefore, may not even be passed through by a parser.
To support round tripping, a fine-grained document storage system must track entity references that are expanded during parsing, as well as ignorable white space and namespace prefix mappings. Such mechanisms are unimportant in terms of querying and retrieval of partial documents, but for some applications, they can be critical for document serialization. Because the degree of round tripping implies extra cost, some systems export configuration options to determine handling of these issues.
Intact document storage has the vastly simplifying advantage of being unconcerned with the data model of the XML documents it stores. Fine-grained document storage must decide on the data model, which is tied closely to query processing and query language support. For example, XQuery's data model is typed, and type information can appear in XQuery expressions. XPath 1.0 expressions, however, are not richly typed, so no additional type information is necessary.
A simple example of the data model issue is DOM vs. XQuery. The DOM is relatively simple. Where most every object is a node, some nodes have names, some have values and some have children and siblings. The DOM essentially is a tree with little semantic information, and virtually all of its information is contained in the XML document itself. Conversely, the XQuery data model is typed. XQuery does support simple, well-formed XML; however, it also supports type information, as obtained from a schema-validated document, where the schema information comes from outside the document.
It is possible to choose a storage data model equivalent to the XML infoset or DOM, but then the powerful type facilities of XPath 2.0 and XQuery 1.0 are not fully available. A schema-validated document has type information available at the time it is parsed and validated. A system where parsing, validation and querying occur at the same time has no problem obtaining type information to satisfy the query. However, in a fine-grained storage system, the parsing and query events are not related. This means that at the time of the query, type information must be found if it is to be used for the query. There are several choices for how a system can implement types:
Store type information with each document and typed object and materialize it for querying.
Store references to relevant schema files and reload (parse) them for querying.
Map each type to the nearest atomic type in the XML Schema recommendation and store that information.
Don't support type information at all, which limits queries and forces them to use their own, complex type definitions.
Granularity of addressability is tied closely to the data model. At one extreme is the choice of DOM objects as the addressable unit. This means that each DOM node, be it a document, element or attribute value, is an addressable and separately stored object. Although simple, this approach is quite expensive in terms of memory, disk space and CPU. There are other, coarser-grained solutions. One is to use the element as an addressable unit and associate its attributes and child text nodes. Another is to address elements and text nodes and associate attributes with elements. The former may be better for locality of reference, if an element and its attributes and text nodes are likely to be referenced together.
Native XML databases that store documents as fine-grained nodes must assign addressable node identifiers (node IDs) to addressable units. Node IDs are used to retrieve specific nodes during processing. When it comes to physical storage, size matters. Smaller nodes and node IDs mean better locality of reference and fewer disk accesses to read and write data.
Berkeley DB XML stores nodes in a B-tree, where node IDs are allocated in document order, which also is an iteration order on the B-tree. This means that once a node is located, serialization or child navigation can occur by way of iteration rather than by additional lookup operations.
With the appropriate sorting/comparison function, a node ID that is a B-tree key can take on many physical forms. It can be as simple as an integer, or it can be a complex array or string. Node numbering is one of the more interesting and important design choices in a native XML database. There are node numbering schemes that have the ability to allow insertion and removal of arbitrary nodes without renumbering and to allow query-relevant operations to be performed based solely on node numbers and indexes, eliminating node lookups.
Berkeley DB XML uses a numbering scheme that allows some direct relationship comparisons and attempts to minimize the need to materialize nodes for navigation. The scheme also avoids renumbering when a document is modified partially.
One advantage of fine-grained storage is the ability to modify some parts of documents without touching the rest. There is a significant performance and scalability benefit in such “surgical” changes; however, it can be difficult to do efficiently. Many systems do not support partial modification of documents, and if they do, it is only through a well-defined interface such as XUpdate, as opposed to a direct DOM manipulation.
A partial modification can render a document invalid, or worse, malformed. Re-parsing for validation, however, negates much of the benefit of partial modifications. Insertion or removal of an addressable object, such as an element, affects the system's node numbering scheme, as described above. Indexes also are affected and must be updated. A database may choose to revalidate or parse after a modification or allow the application to request it explicitly.
Fine-grained document storage has a disadvantage in serialization of an entire document. In this situation, an iterator must traverse the addressable pieces of the document. If this is a common operation, it may be worth optimizing or caching the serialized document for reuse, which creates a possible concurrency problem. Document serialization can be optimized by maintaining addressable units in document order, keeping names in stored nodes rather than name IDs and using coarser granularity, which leads to fewer objects retrieved from disk.
Proper specification and use of indexes can increase query speeds by orders of magnitude. However, indexes consume space on disk and in the cache—a classic space versus speed trade-off. Under certain situations, the presence of indexes slows operations. When frequently updating indexed data, time spent re-indexing can offset the benefit of indexed access.
The data models for querying XML imply that virtually all indexes deal with elements, attributes and their respective text content, as well as possible data types represented by their value strings. However, there is no standard or convention regarding how to specify indexes or even what is indexed and how. Different XML databases have made different choices regarding indexes in these areas:
Index Type—structure, value, full-text.
Index Scope—document, collection.
Index Target—document, node.
Index Control—automatic, voluntary, required.
Structural indexes are used for tracking structure and path information, such as “track existence of all element nodes with the path /a/b/c” or “track all paths to the node c.” Such indexes are useful for navigational portions of queries. Some indexes reduce the result set to a smaller set of possible results, rather than give a single definitive result. For example, the index above that tracks all paths /a/b/c can be positive about its answer to the query /a/b/c. The index that tracks all paths to c cannot be definite, because it also contains entries for paths such as /e/f/c.
Value indexes are used to track all values for specific elements or attributes. A value index on the element “color” would have an index entry for every separate instance of color and would be useful for a query such as //color[.='green']. In addition, value indexes may be typed so that comparisons can be performed correctly. The typed data model of XPath 2.0 and XQuery 1.0 brings a long list of potential data types from the XML Schema recommendation, such as xs:date, xs:time and various numeric formats. Support for typed indexes allows applications to use them directly rather than modify their content to map, for example, xs:datetime to integer, so that range-based comparisons can be used.
Full-text indexing is a large topic unto itself. There is a working draft for full-text extensions to XQuery, but it is not yet in general use. Some native XML database products implement what they call full-text indexing, which minimally is a word index over a document. Because there is no standard, a full-text index requires a proprietaryy query language or extension as an interface.
Most native XML databases store documents in a collection. The scope of a given index could be collection-wide or it could be restricted to a single document. A native XML database system can choose the index scope it implements. Queries against a collection can return documents or sets of nodes within documents. In order to support efficient restriction of a query to a manageable set of documents, the system must support indexes at the collection scope. This does not mean that it is not also possible to have indexes at the document scope, which contain entries that apply only to a given document.
Related to scope is the target or the object referenced by an index entry. It can be a document or an object within a document. An index is capable of pointing down to the addressable unit in the system, but such granularity is not always necessary and can be expensive. Because navigational operations within a document stored with fine granularity are not as expensive as those used for intact document storage, due to parsing, it can be sufficient to return the document element for further navigation. Although this is possible, it is the case that most database systems with fine-grained document storage reference directly to nodes in indexes rather than to the containing documents.
Another dimension of index type is how indexes are specified. Voluntary indexes are specified explicitly by an interface to the system. These indexes allow for some experimentation to find the minimal useful set of indexes. Some systems have automatic indexes, where a well-defined set of indexes always is created, except for those that are disabled explicitly, by way of configuration or interface. The system also may have required indexes, which cannot be disabled because they are necessary for proper functioning of the system.
George Feinberg is the architect for Sleepycat Software's Berkeley DB XML. Prior to that, Feinberg was one of the architects of the eXcelon native XML database, now called XML Information Server (XIS) and owned by Progress Software. He was eXcelon's representative to the W3C and the XML Schema working group. Feinberg's previous experience includes serving as an operating system designer and developer for the Open Software Foundation (now The Open Group), Hewlett-Packard and a storage system startup.