Native XML Data Storage and Retrieval
The design and implementation trade-offs within a native XML database make a significant impact on the performance, scalability and features available to applications that use it. This article focuses on the granularity of stored XML documents and indexing as two of the most critical design considerations. Berkeley DB XML from Sleepycat Software (www.sleepycat.com/products/xml.shtml) is the basis for this discussion.
The basic functions of an XML database are to store documents, query over documents and handle query results. Of course, indexes are required to obtain acceptable query performance.
In a relational database, pieces of a relational table are stored, queries are SQL and results are tabular. This abstraction and standardization is useful from an application developer's perspective. Developers have less visibility into precisely how documents are stored and indexed and how a query can leverage the combination of storage format, indexes and query language to answer a question quickly.
The same concepts exist in a native XML database, such as Berkeley DB XML. In this case, the data is the XML document and the query may be an XPath or XQuery expression. The results may be XML documents, DOM, SAX or a proprietary form. Within a native XML database, mechanisms for storage, indexing and querying are not obvious from the perspective of an application developer, yet they are critical to the function, performance and scalability of the overall system.
A native XML database exposes a logical model of storing and retrieving XML documents; however, its internal storage model may not be equivalent to the document. Indexing is a crucial component of any database. Without intelligent indexing, a database is little better than a filesystem for information retrieval. Query processing builds on both storage format and indexes but is beyond the scope of this article.
Most native XML databases are oriented toward storing XML documents, where a key issue is the granularity with which the document is stored. In database terms, granularity can be described in several different ways: external access, internal addressability and concurrency.
A distinction is made between access granularity and addressability. Addressability refers to objects that can be named and accessed directly, without navigation, within the system. Access may be provided through a DOM to a system with an addressable granularity of an XML document, by parsing the document. In this sense, access granularity is user-visible, while addressability is an internal concept. Concurrency means how objects can be modified concurrently, if such a feature is supported.
There are two major choices in terms of how to store a document—intact or not intact. Systems that store XML documents intact usually parse the XML in order to ensure it is well formed and valid but otherwise store documents unchanged. This is useful for applications that require retrieval of the entire byte-for-byte document or for round tripping. Furthermore, for relatively small documents that tend to be retrieved and processed whole, such a system is ideal. The major issue for intact document storage is how to address target documents within a collection of documents. There are two primary mechanisms to do this: a unique identifier, such as name or document ID, or a query expression, such as XQuery. The first results in exactly one document, whereas the latter may return many documents in a result set.
For a large collection, it must be possible to target a small set of result documents in a query. For intact document storage, this implies an indexing mechanism. If a document is parsed upon insertion into a collection, it can be indexed as well, based on the system's indexing specifications. Indexes in this type of system use document granularity addressing. It is desirable to avoid parsing documents in order to resolve a query. Additional parsing can be avoided if the query can be answered definitively from indexes and the access granularity desired by the application is at the document level, as opposed to DOM granularity access.
A clear disadvantage of intact document storage is that for certain applications and queries, it can take a long time and a large amount of memory to process a request. This is mostly due to the need to parse documents to satisfy a query. Optimizations, such as references to offsets within a document, can be made, however, for read-only documents.
The advantages of intact document storage include its simplicity and byte-for-byte round tripping. Berkeley DB XML has an option to store documents intact.
|Free Today: September Issue of Linux Journal (Retail value: $5.99)||Sep 27, 2016|
|nginx||Sep 27, 2016|
|Epiq Solutions' Sidekiq M.2||Sep 26, 2016|
|Nativ Disc||Sep 23, 2016|
|Android Browser Security--What You Haven't Been Told||Sep 22, 2016|
|The Many Paths to a Solution||Sep 21, 2016|
- Free Today: September Issue of Linux Journal (Retail value: $5.99)
- Android Browser Security--What You Haven't Been Told
- Readers' Choice Awards 2013
- Epiq Solutions' Sidekiq M.2
- Nativ Disc
- The Many Paths to a Solution
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- Synopsys' Coverity
- Tech Tip: Really Simple HTTP Server with Python
Pick up any e-commerce web or mobile app today, and you’ll be holding a mashup of interconnected applications and services from a variety of different providers. For instance, when you connect to Amazon’s e-commerce app, cookies, tags and pixels that are monitored by solutions like Exact Target, BazaarVoice, Bing, Shopzilla, Liveramp and Google Tag Manager track every action you take. You’re presented with special offers and coupons based on your viewing and buying patterns. If you find something you want for your birthday, a third party manages your wish list, which you can share through multiple social- media outlets or email to a friend. When you select something to buy, you find yourself presented with similar items as kind suggestions. And when you finally check out, you’re offered the ability to pay with promo codes, gifts cards, PayPal or a variety of credit cards.Get the Guide