Native XML Data Storage and Retrieval

A new generation of databases creates a new set of decisions and several full-featured ways to build queries.
Fine Granularity Storage

Some native XML databases, such as Berkeley DB XML, store documents with granularity finer than the document. The properties of such systems include: addressability is subdocument level, access granularity is subdocument level and concurrency granularity may or may not be finer than document level.

Storing documents in pieces offers a number of advantages, including:

  • Ability to reference an element or other object within a document directly.

  • Ability to retrieve partial documents without parsing.

  • Efficient querying, without parsing, by materializing only those parts of a document necessary to evaluate the query.

  • Ability to modify a small piece of a large document.

The decision to store documents in pieces results in more choices:

  • Degree of round tripping supported, if any.

  • What information is stored or the data model of the storage.

  • Granularity of addressability.

  • Support for partial document modification, without rewriting the entire document.

  • Physical format of information.

Fine-grained document storage systems must choose the degree of round tripping supported if it is a requirement to be able to return the original document, byte for byte. Virtually any decomposition of a document for storage results in loss or change of information, such as reordering of attributes, or a change in the XML declaration. This is because there is not a 1:1 mapping from XML infoset to bytes in a document. That is, there are bytes within an XML document that are not considered relevant to the infoset and, therefore, may not even be passed through by a parser.

To support round tripping, a fine-grained document storage system must track entity references that are expanded during parsing, as well as ignorable white space and namespace prefix mappings. Such mechanisms are unimportant in terms of querying and retrieval of partial documents, but for some applications, they can be critical for document serialization. Because the degree of round tripping implies extra cost, some systems export configuration options to determine handling of these issues.

Data Model

Intact document storage has the vastly simplifying advantage of being unconcerned with the data model of the XML documents it stores. Fine-grained document storage must decide on the data model, which is tied closely to query processing and query language support. For example, XQuery's data model is typed, and type information can appear in XQuery expressions. XPath 1.0 expressions, however, are not richly typed, so no additional type information is necessary.

A simple example of the data model issue is DOM vs. XQuery. The DOM is relatively simple. Where most every object is a node, some nodes have names, some have values and some have children and siblings. The DOM essentially is a tree with little semantic information, and virtually all of its information is contained in the XML document itself. Conversely, the XQuery data model is typed. XQuery does support simple, well-formed XML; however, it also supports type information, as obtained from a schema-validated document, where the schema information comes from outside the document.

It is possible to choose a storage data model equivalent to the XML infoset or DOM, but then the powerful type facilities of XPath 2.0 and XQuery 1.0 are not fully available. A schema-validated document has type information available at the time it is parsed and validated. A system where parsing, validation and querying occur at the same time has no problem obtaining type information to satisfy the query. However, in a fine-grained storage system, the parsing and query events are not related. This means that at the time of the query, type information must be found if it is to be used for the query. There are several choices for how a system can implement types:

  • Store type information with each document and typed object and materialize it for querying.

  • Store references to relevant schema files and reload (parse) them for querying.

  • Map each type to the nearest atomic type in the XML Schema recommendation and store that information.

  • Don't support type information at all, which limits queries and forces them to use their own, complex type definitions.

Granularity of addressability is tied closely to the data model. At one extreme is the choice of DOM objects as the addressable unit. This means that each DOM node, be it a document, element or attribute value, is an addressable and separately stored object. Although simple, this approach is quite expensive in terms of memory, disk space and CPU. There are other, coarser-grained solutions. One is to use the element as an addressable unit and associate its attributes and child text nodes. Another is to address elements and text nodes and associate attributes with elements. The former may be better for locality of reference, if an element and its attributes and text nodes are likely to be referenced together.

Native XML databases that store documents as fine-grained nodes must assign addressable node identifiers (node IDs) to addressable units. Node IDs are used to retrieve specific nodes during processing. When it comes to physical storage, size matters. Smaller nodes and node IDs mean better locality of reference and fewer disk accesses to read and write data.

Berkeley DB XML stores nodes in a B-tree, where node IDs are allocated in document order, which also is an iteration order on the B-tree. This means that once a node is located, serialization or child navigation can occur by way of iteration rather than by additional lookup operations.

With the appropriate sorting/comparison function, a node ID that is a B-tree key can take on many physical forms. It can be as simple as an integer, or it can be a complex array or string. Node numbering is one of the more interesting and important design choices in a native XML database. There are node numbering schemes that have the ability to allow insertion and removal of arbitrary nodes without renumbering and to allow query-relevant operations to be performed based solely on node numbers and indexes, eliminating node lookups.

Berkeley DB XML uses a numbering scheme that allows some direct relationship comparisons and attempts to minimize the need to materialize nodes for navigation. The scheme also avoids renumbering when a document is modified partially.

One advantage of fine-grained storage is the ability to modify some parts of documents without touching the rest. There is a significant performance and scalability benefit in such “surgical” changes; however, it can be difficult to do efficiently. Many systems do not support partial modification of documents, and if they do, it is only through a well-defined interface such as XUpdate, as opposed to a direct DOM manipulation.

A partial modification can render a document invalid, or worse, malformed. Re-parsing for validation, however, negates much of the benefit of partial modifications. Insertion or removal of an addressable object, such as an element, affects the system's node numbering scheme, as described above. Indexes also are affected and must be updated. A database may choose to revalidate or parse after a modification or allow the application to request it explicitly.

Fine-grained document storage has a disadvantage in serialization of an entire document. In this situation, an iterator must traverse the addressable pieces of the document. If this is a common operation, it may be worth optimizing or caching the serialized document for reuse, which creates a possible concurrency problem. Document serialization can be optimized by maintaining addressable units in document order, keeping names in stored nodes rather than name IDs and using coarser granularity, which leads to fewer objects retrieved from disk.

______________________

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState