Native XML Data Storage and Retrieval
The design and implementation trade-offs within a native XML database make a significant impact on the performance, scalability and features available to applications that use it. This article focuses on the granularity of stored XML documents and indexing as two of the most critical design considerations. Berkeley DB XML from Sleepycat Software (www.sleepycat.com/products/xml.shtml) is the basis for this discussion.
The basic functions of an XML database are to store documents, query over documents and handle query results. Of course, indexes are required to obtain acceptable query performance.
In a relational database, pieces of a relational table are stored, queries are SQL and results are tabular. This abstraction and standardization is useful from an application developer's perspective. Developers have less visibility into precisely how documents are stored and indexed and how a query can leverage the combination of storage format, indexes and query language to answer a question quickly.
The same concepts exist in a native XML database, such as Berkeley DB XML. In this case, the data is the XML document and the query may be an XPath or XQuery expression. The results may be XML documents, DOM, SAX or a proprietary form. Within a native XML database, mechanisms for storage, indexing and querying are not obvious from the perspective of an application developer, yet they are critical to the function, performance and scalability of the overall system.
A native XML database exposes a logical model of storing and retrieving XML documents; however, its internal storage model may not be equivalent to the document. Indexing is a crucial component of any database. Without intelligent indexing, a database is little better than a filesystem for information retrieval. Query processing builds on both storage format and indexes but is beyond the scope of this article.
Most native XML databases are oriented toward storing XML documents, where a key issue is the granularity with which the document is stored. In database terms, granularity can be described in several different ways: external access, internal addressability and concurrency.
A distinction is made between access granularity and addressability. Addressability refers to objects that can be named and accessed directly, without navigation, within the system. Access may be provided through a DOM to a system with an addressable granularity of an XML document, by parsing the document. In this sense, access granularity is user-visible, while addressability is an internal concept. Concurrency means how objects can be modified concurrently, if such a feature is supported.
There are two major choices in terms of how to store a document—intact or not intact. Systems that store XML documents intact usually parse the XML in order to ensure it is well formed and valid but otherwise store documents unchanged. This is useful for applications that require retrieval of the entire byte-for-byte document or for round tripping. Furthermore, for relatively small documents that tend to be retrieved and processed whole, such a system is ideal. The major issue for intact document storage is how to address target documents within a collection of documents. There are two primary mechanisms to do this: a unique identifier, such as name or document ID, or a query expression, such as XQuery. The first results in exactly one document, whereas the latter may return many documents in a result set.
For a large collection, it must be possible to target a small set of result documents in a query. For intact document storage, this implies an indexing mechanism. If a document is parsed upon insertion into a collection, it can be indexed as well, based on the system's indexing specifications. Indexes in this type of system use document granularity addressing. It is desirable to avoid parsing documents in order to resolve a query. Additional parsing can be avoided if the query can be answered definitively from indexes and the access granularity desired by the application is at the document level, as opposed to DOM granularity access.
A clear disadvantage of intact document storage is that for certain applications and queries, it can take a long time and a large amount of memory to process a request. This is mostly due to the need to parse documents to satisfy a query. Optimizations, such as references to offsets within a document, can be made, however, for read-only documents.
The advantages of intact document storage include its simplicity and byte-for-byte round tripping. Berkeley DB XML has an option to store documents intact.
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- Profiles and RC Files
- Understanding Ceph and Its Place in the Market
- Astronomy for KDE
- The Giant Zero, Part 0.x
- Maru OS Brings Debian to Your Phone
- Git 2.9 Released
- OpenSwitch Finds a New Home
- SoftMaker FreeOffice
- What's Our Next Fight?
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide