Journaling with ReiserFS

Mason gives a tour through the Reiser File System: its features and construction.
User Space Transactions

From time to time, people ask for a version of the transaction API exported to user space. The ReiserFS journal layer was designed to support finite operations that usually complete very quickly, and it would not be a good fit for a general transaction subsystem. It might be a good idea to provide atomic writes to user space, however, and give them more control over grouping operations together. That way an application could request for a 64K file to be created in a certain directory and treat it like an atomic operation. Very little planning has happened in this area thus far.

VM Integration

As memory gets low on the system, the kernel needs to start flushing dirty data to disk so the pages can be freed. But, pinned buffers from uncommitted transactions can't be freed until the transaction commits, leaving the VM unable to do anything without help from the file system. We also want to make sure the journal does not use too high a percentage of the system memory for pinned buffers.

We will be working with the VM developers to give memory pressure to the file systems properly. The API is not set in stone yet, but people seem to be leaning toward a flush callback associated with the page, and a generic memory pressure registration system. It isn't known yet how much of that will happen in the 2.4 kernel and what will be left for 2.5.

ReiserFS and LVM

LVM adds a bunch of cool new features to Linux, one of which is the ability to make read-only snapshots of a device. The snapshot is created very quickly, and copy-on-write is used to keep the snapshot unchanged as the original device is modified. This allows for on-line, consistent backups of most software on just about any file system.

But, the journaled file systems make this a little harder. When sync is called on ReiserFS, we just commit metadata changes to the log, knowing that a replay will make things consistent after a crash. For a read-only LVM snapshot, log replay is not an option. Instead, we can provide a few new generic calls to flush everything to the main disk and pause new file system modifications. While things are paused, LVM initializes the snapshot, so it will be consistent without a log replay. Once LVM is done, the file system is unlocked and writes proceed normally.

Since all the file system operations need to be able to wait on the log, this was easily coded in ReiserFS. LVM 0.9 and ReiserFS 3.6.18 have this functionality, but we are not sure when the generic calls are going to be added in the kernel. Regardless, patches to provide the missing pieces will be available on the ReiserFS and LVM web sites.

Another LVM feature is the ability to relocate extents from one device to another. If you discover an area of the disk is getting higher-than-average traffic, you can relocate those blocks to a faster device. In fact, you could relocate the entire log area to a faster device, reducing head contention and drive seeks. Relocating the log to a solid-state disk could really improve performance on log intensive applications.

Software RAID

In the 2.2 kernels, the software RAID code could write pinned buffers to disk, which breaks the write ordering constraints used to keep things consistent. Only drive striping and concatenation were completely safe, and mirroring was safe as long as you did not use the on-line rebuild code. In the 2.4 kernels, all software RAID levels should work properly with the journaled file systems.

ReiserFS and NFS

ReiserFS has problems supporting NFS because 64 bits of information are required to find an object in the tree, and NFS expects to find an inode with just the inode number (32-bits long). The good news is the NFS file handle has enough room to store the extra information ReiserFS needs in order to find the file again later, and other kernel developers have written APIs to give the file system control over some of the file handle. By the time this article is out, there should be public patches to add proper NFS support to ReiserFS.

Write Caching

For performance benchmarks, some of the new drives have write-back caching by default. This means the drive reports a write is completed before it is actually on the media. The block is still in the drive's cache, where the writes can be reordered. If this happens, metadata changes might be written before the log commit blocks, leading to corruption if the machine loses power. It is very important to disable write-back caching on both IDE and SCSI drives.

Some hardware RAID controllers provide a battery-backed write-back cache that preserves the cache contents if the system loses power. These should be safe to use, but the cache battery should be checked often. A dramatic performance increase can be seen with these write caches, especially for log intensive applications like mail servers.