Trees in the Reiser4 Filesystem, Part I

The basic structure of the new ReiserFS—graphs vs. trees, keys, nodes and blocks.
Choosing Which Subtree

We start our search at the root, because from the root we can reach every other node. But, how do we choose which subtree of the root to go to from the root? The root contains pointers to its subtrees. For each pointer to a subtree there is a corresponding left delimiting key. Pointers to subtrees, and the subtrees themselves, are ordered by their left delimiting key. A subtree pointer's left delimiting key is equal to the least key of the things in the subtree. Its right delimiting key is larger than the largest key in the subtree, and it is the left delimiting key of the next subtree of this node.

Each subtree contains only things whose keys are at least equal to the left delimiting key of its pointer and are not more than its right delimiting key. If there are no duplicate keys in the tree, then each subtree contains only things whose keys are less than its right delimiting key. If there are no duplicate keys, then by looking within a node at its pointers to subtrees and their delimiting keys, we know which subtree of that node contains the thing we want.

Duplicate keys are a topic for another time. For now I will only hint that when searching through objects with duplicate keys we find the first of them in the tree. Then we search through all the duplicates, one by one, until we find what we are looking for. Allowing duplicate keys can allow for smaller keys, so there is sometimes a trade-off between key size and the average frequency of such inefficient linear searches.

The contents of each node in the tree are sorted within the node. Therefore, the entire tree is sorted by key, and for a given key we know exactly where to go to find at least one thing with that key.

Node Size

We choose to make the nodes equal in size, so it is easier to allocate the unused space between nodes, because it will be of a size equal to some multiple of the size of a node. This way, there aren't any problems of space being free but not large enough to store a node. Also, disk drives have an interface that assumes equal size blocks, which they find convenient for their error-correction algorithms.

If having the nodes be equal in size is not very important, perhaps because the tree fits into RAM, then using a class of algorithms called skip lists is worth considering.

Reiser4 nodes are usually equal to the size of a page, which if you use Linux on an Intel CPU is 4,096 (4k) bytes. There is no measured empirical reason to think this size is better than others; it is simply the one that Linux makes easiest and cleanest to program. Quite honestly, we have been too busy to experiment with other sizes.

Sharing Blocks Saves Space

If nodes are of equal size, how do we store large objects? We chop them into pieces, and we call these pieces items. An item is sized to fit within a single node.

Conventional filesystems store files in whole blocks. Roughly speaking, this means that, on average, half a block of space is wasted per file because not all of the last block of the file is used. If a file is much smaller than a block, the space wasted is a large percentage of the total file size. It is not effective to store such typical database objects as addresses and phone numbers in separately named files in a conventional filesystem because more than 90% of the space in the storage blocks is wasted. By putting multiple items within a single node in Reiser4, we are able to pack multiple small pieces of files into one block. Our space efficiency is roughly 94% for small files. This does not count per-item-formatting overhead, whose percentage of total space consumed depends on average item size and, for that reason, is hard to quantify.

Aligning files to 4k boundaries does have advantages for large files though. When a program wants to operate directly on file data without going through system calls to do it, it can use mmap() to make the file data part of the process' directly accessible address space. Due to some implementation details, mmap() needs file data to be 4k-aligned. If the data is already 4k-aligned, it makes mmap() much more efficient. In Reiser4 the current default is that files larger than 16k are 4k-aligned. We don't yet have enough empirical data and experience to know whether 16k is the optimal default value for this cutoff point.

Leaves, Twigs and Branches

Continuing with our tree metaphor, leaves are nodes that have no children. Internal nodes are nodes that have children.

Figure 3. A Height = 4, Four-Level, Fanout = 3, Balanced Tree

A search starts with the root node, traverses two more internal nodes and ends with a leaf node, which is a node that holds the data and has no children.

An item is a data container that is contained entirely within a single node. A node that contains items is called a formatted node. When we store objects in the tree, we put their various parts into items and unformatted leaves. Unformatted leaves (unfleaves) are leaves that contain only data and do not contain any formatting information. Only leaves can contain unformatted data. Pointers are stored in items, and so all internal nodes are necessarily formatted nodes.

Pointers to unfleaves are different in their structure from pointers to formatted nodes. To clarify, extent pointers point to unfleaves. An extent is a sequence of contiguous, in block number order, unfleaves that belong to the same object. An extent pointer contains the starting block number of the extent and a length. Because the extent belongs to only one object, we can store only one key for the extent, and we can calculate the key of any byte within that extent. If the extent is at least two blocks long, extent pointers are more compact than regular node pointers.

Node pointers are pointers to formatted nodes. We do not yet have a compressed version of node pointers, but they should be available soon. Notice that with extent pointers we don't have to store the delimiting key of each node pointed to, while we need to do this when using node pointers. We will probably introduce key compression at the same time we add compressed node pointers. One would expect keys to compress well because they are sorted into ascending order. We expect our node and item plugin infrastructure will make such features easy to add at a later date.

Twigs are parents of leaves, and extent pointers exist only in twigs. (This is a very controversial design decision I will discuss in Part II of this article.) Branches are internal nodes that are not twigs.

You might think we would number the root level 1, but since the tree grows at the top, it turns out to be more useful for number 1 to be the level with the leaves where object data is stored. The height of the tree depends upon how many objects we have to store and what the fanout rate (average number of children) of the internal and twig nodes will be.

Figure 4. A Reiser4 Tree, Including the Internal Nodes Called Twig Nodes

______________________

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix