Native XML Data Storage and Retrieval
Some native XML databases, such as Berkeley DB XML, store documents with granularity finer than the document. The properties of such systems include: addressability is subdocument level, access granularity is subdocument level and concurrency granularity may or may not be finer than document level.
Storing documents in pieces offers a number of advantages, including:
Ability to reference an element or other object within a document directly.
Ability to retrieve partial documents without parsing.
Efficient querying, without parsing, by materializing only those parts of a document necessary to evaluate the query.
Ability to modify a small piece of a large document.
The decision to store documents in pieces results in more choices:
Degree of round tripping supported, if any.
What information is stored or the data model of the storage.
Granularity of addressability.
Support for partial document modification, without rewriting the entire document.
Physical format of information.
Fine-grained document storage systems must choose the degree of round tripping supported if it is a requirement to be able to return the original document, byte for byte. Virtually any decomposition of a document for storage results in loss or change of information, such as reordering of attributes, or a change in the XML declaration. This is because there is not a 1:1 mapping from XML infoset to bytes in a document. That is, there are bytes within an XML document that are not considered relevant to the infoset and, therefore, may not even be passed through by a parser.
To support round tripping, a fine-grained document storage system must track entity references that are expanded during parsing, as well as ignorable white space and namespace prefix mappings. Such mechanisms are unimportant in terms of querying and retrieval of partial documents, but for some applications, they can be critical for document serialization. Because the degree of round tripping implies extra cost, some systems export configuration options to determine handling of these issues.
Intact document storage has the vastly simplifying advantage of being unconcerned with the data model of the XML documents it stores. Fine-grained document storage must decide on the data model, which is tied closely to query processing and query language support. For example, XQuery's data model is typed, and type information can appear in XQuery expressions. XPath 1.0 expressions, however, are not richly typed, so no additional type information is necessary.
A simple example of the data model issue is DOM vs. XQuery. The DOM is relatively simple. Where most every object is a node, some nodes have names, some have values and some have children and siblings. The DOM essentially is a tree with little semantic information, and virtually all of its information is contained in the XML document itself. Conversely, the XQuery data model is typed. XQuery does support simple, well-formed XML; however, it also supports type information, as obtained from a schema-validated document, where the schema information comes from outside the document.
It is possible to choose a storage data model equivalent to the XML infoset or DOM, but then the powerful type facilities of XPath 2.0 and XQuery 1.0 are not fully available. A schema-validated document has type information available at the time it is parsed and validated. A system where parsing, validation and querying occur at the same time has no problem obtaining type information to satisfy the query. However, in a fine-grained storage system, the parsing and query events are not related. This means that at the time of the query, type information must be found if it is to be used for the query. There are several choices for how a system can implement types:
Store type information with each document and typed object and materialize it for querying.
Store references to relevant schema files and reload (parse) them for querying.
Map each type to the nearest atomic type in the XML Schema recommendation and store that information.
Don't support type information at all, which limits queries and forces them to use their own, complex type definitions.
Granularity of addressability is tied closely to the data model. At one extreme is the choice of DOM objects as the addressable unit. This means that each DOM node, be it a document, element or attribute value, is an addressable and separately stored object. Although simple, this approach is quite expensive in terms of memory, disk space and CPU. There are other, coarser-grained solutions. One is to use the element as an addressable unit and associate its attributes and child text nodes. Another is to address elements and text nodes and associate attributes with elements. The former may be better for locality of reference, if an element and its attributes and text nodes are likely to be referenced together.
Native XML databases that store documents as fine-grained nodes must assign addressable node identifiers (node IDs) to addressable units. Node IDs are used to retrieve specific nodes during processing. When it comes to physical storage, size matters. Smaller nodes and node IDs mean better locality of reference and fewer disk accesses to read and write data.
Berkeley DB XML stores nodes in a B-tree, where node IDs are allocated in document order, which also is an iteration order on the B-tree. This means that once a node is located, serialization or child navigation can occur by way of iteration rather than by additional lookup operations.
With the appropriate sorting/comparison function, a node ID that is a B-tree key can take on many physical forms. It can be as simple as an integer, or it can be a complex array or string. Node numbering is one of the more interesting and important design choices in a native XML database. There are node numbering schemes that have the ability to allow insertion and removal of arbitrary nodes without renumbering and to allow query-relevant operations to be performed based solely on node numbers and indexes, eliminating node lookups.
Berkeley DB XML uses a numbering scheme that allows some direct relationship comparisons and attempts to minimize the need to materialize nodes for navigation. The scheme also avoids renumbering when a document is modified partially.
One advantage of fine-grained storage is the ability to modify some parts of documents without touching the rest. There is a significant performance and scalability benefit in such “surgical” changes; however, it can be difficult to do efficiently. Many systems do not support partial modification of documents, and if they do, it is only through a well-defined interface such as XUpdate, as opposed to a direct DOM manipulation.
A partial modification can render a document invalid, or worse, malformed. Re-parsing for validation, however, negates much of the benefit of partial modifications. Insertion or removal of an addressable object, such as an element, affects the system's node numbering scheme, as described above. Indexes also are affected and must be updated. A database may choose to revalidate or parse after a modification or allow the application to request it explicitly.
Fine-grained document storage has a disadvantage in serialization of an entire document. In this situation, an iterator must traverse the addressable pieces of the document. If this is a common operation, it may be worth optimizing or caching the serialized document for reuse, which creates a possible concurrency problem. Document serialization can be optimized by maintaining addressable units in document order, keeping names in stored nodes rather than name IDs and using coarser granularity, which leads to fewer objects retrieved from disk.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
| Dart: a New Web Programming Experience | May 07, 2013 |
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- What's the tweeting protocol?
- New Products
- Readers' Choice Awards
- RSS Feeds
- Dart: a New Web Programming Experience
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




12 hours 6 min ago
14 hours 38 min ago
15 hours 55 min ago
16 hours 30 min ago
16 hours 53 min ago
21 hours 41 min ago
22 hours 28 min ago
1 day 2 min ago
1 day 1 hour ago
1 day 3 hours ago