Native XML Data Storage and Retrieval
Some native XML databases, such as Berkeley DB XML, store documents with granularity finer than the document. The properties of such systems include: addressability is subdocument level, access granularity is subdocument level and concurrency granularity may or may not be finer than document level.
Storing documents in pieces offers a number of advantages, including:
Ability to reference an element or other object within a document directly.
Ability to retrieve partial documents without parsing.
Efficient querying, without parsing, by materializing only those parts of a document necessary to evaluate the query.
Ability to modify a small piece of a large document.
The decision to store documents in pieces results in more choices:
Degree of round tripping supported, if any.
What information is stored or the data model of the storage.
Granularity of addressability.
Support for partial document modification, without rewriting the entire document.
Physical format of information.
Fine-grained document storage systems must choose the degree of round tripping supported if it is a requirement to be able to return the original document, byte for byte. Virtually any decomposition of a document for storage results in loss or change of information, such as reordering of attributes, or a change in the XML declaration. This is because there is not a 1:1 mapping from XML infoset to bytes in a document. That is, there are bytes within an XML document that are not considered relevant to the infoset and, therefore, may not even be passed through by a parser.
To support round tripping, a fine-grained document storage system must track entity references that are expanded during parsing, as well as ignorable white space and namespace prefix mappings. Such mechanisms are unimportant in terms of querying and retrieval of partial documents, but for some applications, they can be critical for document serialization. Because the degree of round tripping implies extra cost, some systems export configuration options to determine handling of these issues.
Intact document storage has the vastly simplifying advantage of being unconcerned with the data model of the XML documents it stores. Fine-grained document storage must decide on the data model, which is tied closely to query processing and query language support. For example, XQuery's data model is typed, and type information can appear in XQuery expressions. XPath 1.0 expressions, however, are not richly typed, so no additional type information is necessary.
A simple example of the data model issue is DOM vs. XQuery. The DOM is relatively simple. Where most every object is a node, some nodes have names, some have values and some have children and siblings. The DOM essentially is a tree with little semantic information, and virtually all of its information is contained in the XML document itself. Conversely, the XQuery data model is typed. XQuery does support simple, well-formed XML; however, it also supports type information, as obtained from a schema-validated document, where the schema information comes from outside the document.
It is possible to choose a storage data model equivalent to the XML infoset or DOM, but then the powerful type facilities of XPath 2.0 and XQuery 1.0 are not fully available. A schema-validated document has type information available at the time it is parsed and validated. A system where parsing, validation and querying occur at the same time has no problem obtaining type information to satisfy the query. However, in a fine-grained storage system, the parsing and query events are not related. This means that at the time of the query, type information must be found if it is to be used for the query. There are several choices for how a system can implement types:
Store type information with each document and typed object and materialize it for querying.
Store references to relevant schema files and reload (parse) them for querying.
Map each type to the nearest atomic type in the XML Schema recommendation and store that information.
Don't support type information at all, which limits queries and forces them to use their own, complex type definitions.
Granularity of addressability is tied closely to the data model. At one extreme is the choice of DOM objects as the addressable unit. This means that each DOM node, be it a document, element or attribute value, is an addressable and separately stored object. Although simple, this approach is quite expensive in terms of memory, disk space and CPU. There are other, coarser-grained solutions. One is to use the element as an addressable unit and associate its attributes and child text nodes. Another is to address elements and text nodes and associate attributes with elements. The former may be better for locality of reference, if an element and its attributes and text nodes are likely to be referenced together.
Native XML databases that store documents as fine-grained nodes must assign addressable node identifiers (node IDs) to addressable units. Node IDs are used to retrieve specific nodes during processing. When it comes to physical storage, size matters. Smaller nodes and node IDs mean better locality of reference and fewer disk accesses to read and write data.
Berkeley DB XML stores nodes in a B-tree, where node IDs are allocated in document order, which also is an iteration order on the B-tree. This means that once a node is located, serialization or child navigation can occur by way of iteration rather than by additional lookup operations.
With the appropriate sorting/comparison function, a node ID that is a B-tree key can take on many physical forms. It can be as simple as an integer, or it can be a complex array or string. Node numbering is one of the more interesting and important design choices in a native XML database. There are node numbering schemes that have the ability to allow insertion and removal of arbitrary nodes without renumbering and to allow query-relevant operations to be performed based solely on node numbers and indexes, eliminating node lookups.
Berkeley DB XML uses a numbering scheme that allows some direct relationship comparisons and attempts to minimize the need to materialize nodes for navigation. The scheme also avoids renumbering when a document is modified partially.
One advantage of fine-grained storage is the ability to modify some parts of documents without touching the rest. There is a significant performance and scalability benefit in such “surgical” changes; however, it can be difficult to do efficiently. Many systems do not support partial modification of documents, and if they do, it is only through a well-defined interface such as XUpdate, as opposed to a direct DOM manipulation.
A partial modification can render a document invalid, or worse, malformed. Re-parsing for validation, however, negates much of the benefit of partial modifications. Insertion or removal of an addressable object, such as an element, affects the system's node numbering scheme, as described above. Indexes also are affected and must be updated. A database may choose to revalidate or parse after a modification or allow the application to request it explicitly.
Fine-grained document storage has a disadvantage in serialization of an entire document. In this situation, an iterator must traverse the addressable pieces of the document. If this is a common operation, it may be worth optimizing or caching the serialized document for reuse, which creates a possible concurrency problem. Document serialization can be optimized by maintaining addressable units in document order, keeping names in stored nodes rather than name IDs and using coarser granularity, which leads to fewer objects retrieved from disk.
|Dynamic DNS—an Object Lesson in Problem Solving||May 21, 2013|
|Using Salt Stack and Vagrant for Drupal Development||May 20, 2013|
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
|Non-Linux FOSS: Seashore||May 10, 2013|
- Dynamic DNS—an Object Lesson in Problem Solving
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- RSS Feeds
- The Secret Password Is...
- New Products
4 hours 24 min ago
- Keeping track of IP address
6 hours 15 min ago
- Roll your own dynamic dns
11 hours 29 min ago
- Please correct the URL for Salt Stack's web site
14 hours 40 min ago
- Android is Linux -- why no better inter-operation
16 hours 55 min ago
- Connecting Android device to desktop Linux via USB
17 hours 24 min ago
- Find new cell phone and tablet pc
18 hours 22 min ago
19 hours 51 min ago
- Automatically updating Guest Additions
20 hours 59 min ago
- I like your topic on android
21 hours 46 min ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi
It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?