Journaling with ReiserFS
From time to time, people ask for a version of the transaction API exported to user space. The ReiserFS journal layer was designed to support finite operations that usually complete very quickly, and it would not be a good fit for a general transaction subsystem. It might be a good idea to provide atomic writes to user space, however, and give them more control over grouping operations together. That way an application could request for a 64K file to be created in a certain directory and treat it like an atomic operation. Very little planning has happened in this area thus far.
As memory gets low on the system, the kernel needs to start flushing dirty data to disk so the pages can be freed. But, pinned buffers from uncommitted transactions can't be freed until the transaction commits, leaving the VM unable to do anything without help from the file system. We also want to make sure the journal does not use too high a percentage of the system memory for pinned buffers.
We will be working with the VM developers to give memory pressure to the file systems properly. The API is not set in stone yet, but people seem to be leaning toward a flush callback associated with the page, and a generic memory pressure registration system. It isn't known yet how much of that will happen in the 2.4 kernel and what will be left for 2.5.
LVM adds a bunch of cool new features to Linux, one of which is the ability to make read-only snapshots of a device. The snapshot is created very quickly, and copy-on-write is used to keep the snapshot unchanged as the original device is modified. This allows for on-line, consistent backups of most software on just about any file system.
But, the journaled file systems make this a little harder. When sync is called on ReiserFS, we just commit metadata changes to the log, knowing that a replay will make things consistent after a crash. For a read-only LVM snapshot, log replay is not an option. Instead, we can provide a few new generic calls to flush everything to the main disk and pause new file system modifications. While things are paused, LVM initializes the snapshot, so it will be consistent without a log replay. Once LVM is done, the file system is unlocked and writes proceed normally.
Since all the file system operations need to be able to wait on the log, this was easily coded in ReiserFS. LVM 0.9 and ReiserFS 3.6.18 have this functionality, but we are not sure when the generic calls are going to be added in the kernel. Regardless, patches to provide the missing pieces will be available on the ReiserFS and LVM web sites.
Another LVM feature is the ability to relocate extents from one device to another. If you discover an area of the disk is getting higher-than-average traffic, you can relocate those blocks to a faster device. In fact, you could relocate the entire log area to a faster device, reducing head contention and drive seeks. Relocating the log to a solid-state disk could really improve performance on log intensive applications.
In the 2.2 kernels, the software RAID code could write pinned buffers to disk, which breaks the write ordering constraints used to keep things consistent. Only drive striping and concatenation were completely safe, and mirroring was safe as long as you did not use the on-line rebuild code. In the 2.4 kernels, all software RAID levels should work properly with the journaled file systems.
ReiserFS has problems supporting NFS because 64 bits of information are required to find an object in the tree, and NFS expects to find an inode with just the inode number (32-bits long). The good news is the NFS file handle has enough room to store the extra information ReiserFS needs in order to find the file again later, and other kernel developers have written APIs to give the file system control over some of the file handle. By the time this article is out, there should be public patches to add proper NFS support to ReiserFS.
For performance benchmarks, some of the new drives have write-back caching by default. This means the drive reports a write is completed before it is actually on the media. The block is still in the drive's cache, where the writes can be reordered. If this happens, metadata changes might be written before the log commit blocks, leading to corruption if the machine loses power. It is very important to disable write-back caching on both IDE and SCSI drives.
Some hardware RAID controllers provide a battery-backed write-back cache that preserves the cache contents if the system loses power. These should be safe to use, but the cache battery should be checked often. A dramatic performance increase can be seen with these write caches, especially for log intensive applications like mail servers.
Practical Task Scheduling Deployment
July 20, 2016 12:00 pm CDT
One of the best things about the UNIX environment (aside from being stable and efficient) is the vast array of software tools available to help you do your job. Traditionally, a UNIX tool does only one thing, but does that one thing very well. For example, grep is very easy to use and can search vast amounts of data quickly. The find tool can find a particular file or files based on all kinds of criteria. It's pretty easy to string these tools together to build even more powerful tools, such as a tool that finds all of the .log files in the /home directory and searches each one for a particular entry. This erector-set mentality allows UNIX system administrators to seem to always have the right tool for the job.
Cron traditionally has been considered another such a tool for job scheduling, but is it enough? This webinar considers that very question. The first part builds on a previous Geek Guide, Beyond Cron, and briefly describes how to know when it might be time to consider upgrading your job scheduling infrastructure. The second part presents an actual planning and implementation framework.
Join Linux Journal's Mike Diehl and Pat Cameron of Help Systems.
Free to Linux Journal readers.Register Now!
- SUSE LLC's SUSE Manager
- My +1 Sword of Productivity
- Managing Linux Using Puppet
- Non-Linux FOSS: Caffeine!
- Murat Yener and Onur Dundar's Expert Android Studio (Wrox)
- SuperTuxKart 0.9.2 Released
- Doing for User Space What We Did for Kernel Space
- Google's SwiftShader Released
- SourceClear Open
- Parsing an RSS News Feed with a Bash Script