Journaling with ReiserFS
There are a few new file systems coming out for Linux these days, bringing some badly needed features for both servers and desktops. I'll briefly describe some of the key ReiserFS features and discuss some details of the journal layer.
ReiserFS stores all file system objects in a single B*tree. The tree supports:
Dynamic inode allocation
Compact, indexed directories
There are four basic types of items in the tree. Stat data, directory items, indirect items and direct items. Items are found by searching for a key, where the key contains an ID, the offset in the object you are looking for and the item type.
The ReiserFS directories grow and shrink as their contents change. A hash of the file name is used to keep an entry's offset in the directory constant. The tree indexing of this hash allows for very large directories without much performance loss and still provides clean support for NFS and the standard directory operations.
For files, indirect items point to data blocks, and direct items contain packed file data. This packed file data is stored directly in the tree and can share space in the tree nodes with items from other objects. So, for large files, ReiserFS stores block pointers similar to the ones ext2 uses, but for small files we pack the data together to prevent wasted space.
All of these items are resizable by rebalancing the tree. We can append to the packed file data, or if we need another field in the stat data, it can grow to accommodate the new information. The disk format deserves much more detail than I'm giving it here, and you can learn more from the papers on the ReiserFS home page (see Resources).
ReiserFS actually has two main disk formats. The new format introduced in our 2.4 code allows 60-bit file offsets, and the format used in our 2.2 code uses 32-bit offsets. When you mount an older file system under the new kernel, the old format is preserved and large files are not allowed.
There is a mount option for converting to the new format, but the code for mounting the new format under 2.2 kernels is still in beta. Instead of putting out-of-date information in this article, I suggest going to the ReiserFS web site for details on enabling large file support.
My goal here isn't to describe APIs or log data structures; I'm hoping to list the major issues involved in file system logging and how ReiserFS deals with them.
Before we talk about how logging works, let's discuss the problem we are trying to solve. In order to have a consistent file system after a crash, updates need to be atomic. They need to happen completely or not at all. For example, to append blocks onto a file, you need to update the file's block pointers, allocate blocks from the free list and update the superblock. If the system crashes in the middle of these changes, the file might have a pointer to a block still on the free list, or the superblock might not have updated stats in it, or the block you allocated could be lost (not in the file and not on the free list).
The ReiserFS journal uses a simple metadata-only, write-ahead logging scheme. The idea is that before any changes are written to disk, they are first committed to a log. After a crash, committed transactions are replayed, which is nothing more than copying blocks from the log into the main disk area.
Writing changes to the log isn't what makes logging complicated. The hard part is keeping the log from slowing your file system down to a crawl. The most obvious optimization is to write to the log in big sequential chunks and lower the number of commit blocks written. Most operations update a small number of blocks, so the journal combines multiple operations into a large atomic unit.
Modified buffers cannot be flushed until they have been copied to the log, and they cannot be freed until they have been flushed. Larger transactions pin more kernel memory but also make many other optimizations possible. Since ReiserFS stores everything in a balanced tree, the tree frequently needs balancing. Tree blocks are allocated, modified and then freed in another balance later on. With larger transactions, we increase the chance the block will be freed before it is written to the log or the main disk.
It is common for blocks to be logged over and over again. If the superblock is included in transactions one, two and three, it needs to be written to the log once for each transaction. But, it doesn't need to be written the main disk area until after transaction three has finished. The total number of writes needed is lower, and most of the writes are to the sequential log. In some cases, this actually makes logging faster than the original file system was.
Whenever possible, log I/O is done by a worker thread, kreiserfsd. This allows log commits to happen in the background, without slowing down user processes. However, the log is a fixed size, so user processes might have to wait for log space to become available before they can start a new transaction. A great deal of care must be taken to make sure processes waiting on the log don't have resources needed by a process already inside a transaction.
Most of the file system does not need to be aware there is a journal layer keeping things safe, but there are a few new rules that need to be followed. First, it isn't safe to modify a dirty buffer. On SMP systems, another CPU might be writing the buffer while you are changing it, which means the modifications would get to disk before the transaction is committed.
Most operations will alter a limited number of buffers, but file writes and truncates are effectively unbounded. Instead of adding the complexity of unbounded transaction size in the journal layer, I chose to code consistency points into these operations. If the current transaction needs to end, they log enough information to make the file system consistent, and then start a new transaction. When data logging is used, fsync needs to do the same checks.
Another new rule required by the journal layer has to do with reusing blocks when metdata-only logging is used. Picture these two transactions:
1. allocate block 200, insert into the tree<\n> change and log block 200 free block 200 close and commit transaction 1 2. allocate block 200 as a data block change block 200, fsync to disk close and commit transaction 2 [system crash]
After the crash, the transactions are replayed in order. While replaying transaction one, the logged version of block 200 is copied into the main disk, and after replaying transaction two, block 200 is a data block in a file. But, the contents written to block 200 by the fsync are no longer there. ReiserFS avoids this by never allocating a data block until there is no chance a log replay will overwrite the contents with old information. When the file system is full, this means we have to flush transactions to disk and find reusable blocks. Similar checks need to happen if a data block is logged and then written directly later on.
Now that fsck isn't needed after every crash, we need to be more careful with lost files. An unlinked file isn't actually deleted until the last open process using it finishes. If the system crashes before the delete operation is complete, the journal will give a consistent file system, but some space will still be allocated to the file. Since the file isn't in the directory tree, there isn't a way to reclaim the blocks.
The easiest way to fix this in ReiserFS is to link the file into a special directory. ReiserFS directories are very fast, and there isn't much locking involved if you aren't worried about file name conflicts. After a crash, the directory is read, and file deletion is finished for any objects left. The special directory doesn't actually need file names at all, just the key information for looking up the file. This fix is not yet integrated into the official ReiserFS releases, but it should be soon.