Btrfs on CentOS: Living with Loopback

Introduction

The btrfs filesystem has taunted the Linux community for years, offering a stunning array of features and capability, but never earning universal acclaim. Btrfs is perhaps more deserving of patience, as its promised capabilities dwarf all peers, earning it vocal proponents with great influence. Still, none can argue that btrfs is unfinished, many features are very new, and stability concerns remain for common functions.

Most of the intended goals of btrfs have been met. However, Red Hat famously cut continued btrfs support from their 7.4 release, and has allowed the code to stagnate in their backported kernel since that time. The Fedora project announced their intention to adopt btrfs as the default filesystem for variants of their distribution, in a seeming juxtaposition. SUSE has maintained btrfs support for their own distribution and the greater community for many years.

For users, the most desirable features of btrfs are transparent compression and snapshots; these features are stable, and relatively easy to add as a veneer to stock CentOS (and its peers). Administrators are further compelled by adjustable checksums, scrubs, and the ability to enlarge as well as (surprisingly) shrink filesystem images, while some advanced btrfs topics (i.e. deduplication, RAID, ext4 conversion) aren't really germane for minimal loopback usage. The systemd init package also has dependencies upon btrfs, among them machinectl and systemd-nspawn. Despite these features, there are many usage patterns that are not directly appropriate for use with btrfs. It is hostile to most databases and many other programs with incompatible I/O, and should be approached with some care.

The two most accessible providers of CentOS-compatible btrfs-enabled kernels are the El Repo Mainline, and the Oracle Unbreakable Enterprise Kernel (UEK), but there are significant provisos on support and features with each of these options. Oracle's kernel does not implement the latest standards for btrfs checksums which enforce filesystem integrity, and there are other organizational issues from a CentOS perspective where Oracle has fallen down. The El Repo Mainline has the latest features, but the use of it is discouraged and it is not supported. Current Fedora kernels also appear to work on CentOS 8, but these installations are more invasive in removing stock kernel components. Users will face Hobson's choice depending upon their need for advanced features or commercial support.

Still, with a capable kernel, these features can be easily enabled on any CentOS, RedHat, or Oracle Linux OS via a loopback mount (at some performance penalty) when running on a default XFS host filesystem. In cases where these features are indispensable, they can prevent a migration to Solaris, FreeBSD, or even SUSE where advanced storage features are more commonplace.

I will make some reference here to my past article on ZFS for Linux, to clarify and translate nomenclature between these two popular filesystems. It is not necessary to understand ZFS to grasp this discussion of btrfs, but the contrast can be helpful.

Installation

The two providers of btrfs-enabled kernels that are most compatible with a CentOS installation offer a very different experience with support and features. The installation of both kernels for evaluation is likely the most convenient, and the Send/Receive section below assumes that their features are both present.

CentOS 7 did support native btrfs as a custom option in the OS installer (as a "technology preview"), but this was removed from the CentOS 8 installer, so it won't be conered here. CentOS 8 is used for all examples presented; CentOS 7 users should likely prefer the UEK.

To install the Oracle UEK, add the following file as /etc/yum.repos.d/uek-olx.repo (for CentOS 7, change the "ol8/OL8" to "ol7/OL7"):

[ol8_UEKR6]
name=Latest Unbreakable Enterprise Kernel Release 6 for Oracle Linux $releasever ($basearch)
baseurl=https://yum.oracle.com/repo/OracleLinux/OL8/UEKR6/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=1

Load the GPG key for the relevant repo, as (somewhat incorrectly) described in Oracle's instructions:

 curl -o /etc/pki/rpm-gpg/RPM-GPG-KEY-oracle \
	https://yum.oracle.com/RPM-GPG-KEY-oracle-ol8

For an automated install of the UEK, execute the following (leave out the devel package if you don't have a C compiler):

yum install kernel-uek btrfs-progs btrfs-progs-devel

I have added the options --disablerepo=AppStream, --disablerepo=BaseOS, and --disablerepo=extras to coerce dnf to work through a restrictive firewall, pulling only from Oracle's repository:

dnf	--disablerepo=AppStream \
	--disablerepo=BaseOS \
	--disablerepo=extras \
	install kernel-uek btrfs-progs btrfs-progs-devel

The results of this command are below:

Last metadata expiration check: 0:04:38 ago on Tue 15 Sep 2020 11:43:34 AM CDT.
Dependencies resolved.
================================================================================
 Package           Arch   Version                               Repo       Size
================================================================================
Installing:
 btrfs-progs       x86_64 5.4.0-1.el8                           ol8_UEKR6 869 k
 btrfs-progs-devel x86_64 5.4.0-1.el8                           ol8_UEKR6  52 k
 kernel-uek        x86_64 5.4.17-2011.6.2.el8uek                ol8_UEKR6  60 M
Upgrading:
 linux-firmware    noarch 999:20200124-999.4.git1eb2408c.el8    ol8_UEKR6 100 M

Transaction Summary
================================================================================
Install  3 Packages
Upgrade  1 Package

Total download size: 161 M
Is this ok [y/N]: y
Downloading Packages:
(1/4): btrfs-progs-devel-5.4.0-1.el8.x86_64.rpm  21 kB/s |  52 kB     00:02    
(2/4): btrfs-progs-5.4.0-1.el8.x86_64.rpm       225 kB/s | 869 kB     00:03    
(3/4): kernel-uek-5.4.17-2011.6.2.el8uek.x86_64 2.1 MB/s |  60 MB     00:29    
(4/4): linux-firmware-20200124-999.4.git1eb2408 1.1 MB/s | 100 MB     01:27    
--------------------------------------------------------------------------------
Total                                           1.8 MB/s | 161 MB     01:30
Latest Unbreakable Enterprise Kernel Release 6  3.0 MB/s | 3.1 kB     00:00    
Importing GPG key 0xAD986DA3:
 Userid     : "Oracle OSS group (Open Source Software group) "
 Fingerprint: 76FD 3DB1 3AB6 7410 B89D B10E 8256 2EA9 AD98 6DA3
 From       : /etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
Is this ok [y/N]: y
Key imported successfully
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                        1/1 
  Upgrading        : linux-firmware-999:20200124-999.4.git1eb2408c.el8.no   1/5 
  Installing       : btrfs-progs-5.4.0-1.el8.x86_64                         2/5 
  Installing       : btrfs-progs-devel-5.4.0-1.el8.x86_64                   3/5 
  Running scriptlet: kernel-uek-5.4.17-2011.6.2.el8uek.x86_64               4/5 
  Installing       : kernel-uek-5.4.17-2011.6.2.el8uek.x86_64               4/5 
  Running scriptlet: kernel-uek-5.4.17-2011.6.2.el8uek.x86_64               4/5 
  Cleanup          : linux-firmware-20191202-97.gite8a0f4c9.el8.noarch      5/5 
  Running scriptlet: kernel-uek-5.4.17-2011.6.2.el8uek.x86_64               5/5 
  Running scriptlet: linux-firmware-20191202-97.gite8a0f4c9.el8.noarch      5/5 
  Verifying        : btrfs-progs-5.4.0-1.el8.x86_64                         1/5 
  Verifying        : btrfs-progs-devel-5.4.0-1.el8.x86_64                   2/5 
  Verifying        : kernel-uek-5.4.17-2011.6.2.el8uek.x86_64               3/5 
  Verifying        : linux-firmware-999:20200124-999.4.git1eb2408c.el8.no   4/5 
  Verifying        : linux-firmware-20191202-97.gite8a0f4c9.el8.noarch      5/5 
Installed products updated.

Upgraded:
  linux-firmware-999:20200124-999.4.git1eb2408c.el8.noarch                      

Installed:
  btrfs-progs-5.4.0-1.el8.x86_64           btrfs-progs-devel-5.4.0-1.el8.x86_64
  kernel-uek-5.4.17-2011.6.2.el8uek.x86_64

Complete!

For a manual install, pull the latest UEK and associated RPMs from the repository directly:

https://yum.oracle.com/repo/OracleLinux/OL8/UEKR6/x86_64/

After installation, the UEK will configure itself as the default boot kernel. Notice also that a new firmware package is installed in the dnf session above (be prepared to downgrade it back to the CentOS version if the UEK is uninstalled).

There are two major problems with the UEK, both in general and from the CentOS perspective.

First, the Oracle UEKR6 is (currently) too old to use the latest checksum features of btrfs (explained in the next section).

Second, paid support is available for the UEK on CentOS, but only after a conversion of the entire system to Oracle Linux. Loading the UEK does not trigger this conversion. Furthermore, the conversion process appears broken for Centos 8. When attempting to run the centos2ol.sh converter script, it halts with an error that python2 is required. After installing python2 from AppStream, the script fails with the message: "You appear to be running an unsupported distribution. For assistance, please email <ksplice-support_ww@oracle.com>." Examining the script, only CentOS versions 5, 6, and 7 are allowed, and the lack of CentOS 8 support is also hinted on Oracle's website ("centos2ol.sh can convert your CentOS 6 and 7 systems to Oracle Linux"). As the CentOS 8 platform has been available for over a year, Oracle's script appears to be badly out of date. It is questionable if Oracle supports the CentOS 8 platform at all.

The El Repo project previously maintained a historic archive of the final Red Hat backported btrfs source. While this idled version was never released from testing and has been removed, the latest btrfs kernel modules are available elsewhere in their packages.

El Repo refers to their Mainline as the "kernel of last resort" which is usually a developer tool for backporting hardware drivers. It happens to contain btrfs modules with the latest features, which will work perfectly with all the functionality presented here. To load it, obtain and install the following files from the El Repo Mainline Repository (or install the entry for the yum repository itself):

rpm -Uvh \
	kernel-ml-5.8.5-1.el8.elrepo.x86_64.rpm \
	kernel-ml-core-5.8.5-1.el8.elrepo.x86_64.rpm \
	kernel-ml-modules-5.8.5-1.el8.elrepo.x86_64.rpm

Installation should proceed with the following output:

Verifying...                          ################################# [100%]
Preparing...                          ################################# [100%]
Updating / installing...
   1:kernel-ml-core-5.8.5-1.el8.elrepo################################# [ 33%]
   2:kernel-ml-modules-5.8.5-1.el8.elr################################# [ 67%]
   3:kernel-ml-5.8.5-1.el8.elrepo     ################################# [100%]

When complete, reboot, and a new Red Hat kernel (mentioning Oopta) should appear in the Grub menu. El Repo Mainline kernel users who opt to omit the UEK should likely load Oracle's btrfs-progs, as this will allow userspace maintenance.

The Fedora kernel listed below appears to be functional on CentOS, but Fedora components should be installed with much greater care, as they can remove stock kernel packages that are provided by CentOS (both Mainline and the UEK leave the stock kernel intact) when installed in upgrade mode. The use of a Fedora yum repository is likely not safe in this context.

rpm -Uvh \
 kernel-5.9.0-0.rc8.20201007git7575fdda569b.30.fc34.x86_64.rpm \
 kernel-core-5.9.0-0.rc8.20201007git7575fdda569b.30.fc34.x86_64.rpm \
 kernel-modules-5.9.0-0.rc8.20201007git7575fdda569b.30.fc34.x86_64.rpm

rpm -qa | grep ^kernel | sort

Installation should proceed with the following output:

Verifying...                          ################################# [100%]
Preparing...                          ################################# [100%]
Updating / installing...
   1:kernel-core-5.9.0-0.rc8.20201007g################################# [ 17%]
   2:kernel-modules-5.9.0-0.rc8.202010################################# [ 33%]
   3:kernel-5.9.0-0.rc8.20201007git757################################# [ 50%]
Cleaning up / removing...
   4:kernel-4.18.0-193.el8            ################################# [ 67%]
   5:kernel-modules-4.18.0-193.el8    ################################# [ 83%]
   6:kernel-core-4.18.0-193.el8       ################################# [100%]

kernel-5.9.0-0.rc8.20201007git7575fdda569b.30.fc34.x86_64
kernel-core-5.9.0-0.rc8.20201007git7575fdda569b.30.fc34.x86_64
kernel-ml-5.8.5-1.el8.elrepo.x86_64
kernel-ml-core-5.8.5-1.el8.elrepo.x86_64
kernel-ml-modules-5.8.5-1.el8.elrepo.x86_64
kernel-modules-5.9.0-0.rc8.20201007git7575fdda569b.30.fc34.x86_64
kernel-tools-4.18.0-193.el8.x86_64
kernel-tools-libs-4.18.0-193.el8.x86_64
kernel-uek-5.4.17-2011.6.2.el8uek.x86_64

A btrfs-progs package can also be found in Fedora, and it likewise wipes Oracle's packages if they are present.

rpm -Uvh \
 btrfs-progs-5.7-4.fc33.x86_64.rpm \
 btrfs-progs-devel-5.7-4.fc33.x86_64.rpm \
 libbtrfs-5.7-4.fc33.x86_64.rpm \
 libbtrfsutil-5.7-4.fc33.x86_64.rpm

Installation should proceed with the following output:

Verifying...                          ################################# [100%]
Preparing...                          ################################# [100%]
Updating / installing...
   1:libbtrfsutil-5.7-4.fc33          ################################# [ 17%]
   2:libbtrfs-5.7-4.fc33              ################################# [ 33%]
   3:btrfs-progs-5.7-4.fc33           ################################# [ 50%]
   4:btrfs-progs-devel-5.7-4.fc33     ################################# [ 67%]
Cleaning up / removing...
   5:btrfs-progs-devel-5.4.0-1.el8    ################################# [ 83%]
   6:btrfs-progs-5.4.0-1.el8          ################################# [100%]

The problem with the El Repo Mainline is the lack of support, edging on active discouragement of its use. Fedora is likely an "even more last resort" as it can remove CentOS kernel packages unless installed with great care. The latest SUSE kernel might also be of interest, but this is not intended for CentOS, and has not been attempted here. Building a custom kernel might provide the best results, at the cost of any and all support.

Btrfs Creation and Checksums

Inherent in the creation of a btrfs filesystem is the choice of block-level checksums, used to record and enforce integrity and accuracy of all content. In recent history, this choice was limited to the CRC32C algorithm, which is prone to collisions and is not reasonably appropriate for deduplication (which is not covered here).

The choice of checksums is constrained by the two kernels of interest, and the options are stark.

Btrfs has recently implemented new checksums, details of which are described by man 5 btrfs, excerpted here for a "3.5GHz intel CPU" (as stated on the manual page):

Digest Cycles/4KiB Ratio
CRC32C 1700 1.00
XXHASH 2500 1.44
SHA256 105000 61
BLAKE2b 22000 13

The hashes above might be familiar to technical users of cryptographic applications. The NSA's SHA256 algorithm often has support with CPU primitives for fast processing, and is sufficiently collision-resistant for deduplication. The XXHASH algorithm asserts great improvement over CRC32C in avoidance of hash collisions with minimal impact to throughput.

The problem with the above hash choices is laid plain in the manual pages: "To mount such [a] filesystem [the] kernel must support the checksums as well." It must be understood that the current Oracle UEKR6 only supports CRC32C, and will not mount a btrfs filesystem created with any other hash function, even though the userspace tools it distributes both allow and encourage it.

For the rest of this document, the following mount point will be assumed for our main btrfs filesystem. Please create this mount point to follow all further examples:

mkdir /bvault

The following shell fragment demonstrates userspace options versus kernel checksum limitations:

for CSUM in crc32c xxhash sha256 blake2
do fallocate -l 50G /home/CALDRON.BTRFS

   mkfs.btrfs --csum="$CSUM" /home/CALDRON.BTRFS

   mount -o loop /home/CALDRON.BTRFS /bvault
   umount /bvault

   rm -v /home/CALDRON.BTRFS
done

I choose the name "caldron" above in recognition of the ZFS zpool "tank" as a name of greater clarity for the brew of features that we are concocting, and I will occasionally refer to this as the "backing store." On the Oracle UEK, only the CRC32 mount attempt will succeed, while the El Repo Mainline will run the script with all checksum types without error.

The use of fallocate above was inspired by a previous article on btrfs loopback devices that seeded this discussion. Loopback mounts are commonly used on .ISO images historically written to optical media. Note that loopback filesystem mounts are generally fast, but can suffer when fsync() calls are excessive.

The Oracle UEK is a recent 5.4.17 kernel, but not recent enough to support anything beyond CRC32C, as is gleaned from man 5 btrfs: "Since kernel 5.5 there are three more [checksums] with different characteristics and trade-offs regarding speed and strength." When even Oracle neglects to backport new features, we come to see Red Hat's point.

When running the UEK, this is the best checksum available on a new btrfs filesystem:

fallocate -l 50G /home/CALDRON.BTRFS
mkfs.btrfs --csum=crc32c /home/CALDRON.BTRFS
mount -o loop /home/CALDRON.BTRFS /bvault

If the El Repo Mainline is the active kernel, then any checksum implemented by Oracle's mkfs.btrfs may be used.

From a ZFS perspective, this is all very primitive. ZFS allows the dynamic selection of any checksum/hash function, to be applied to any filesystem object. In any case, SHA256 is likely the preferred choice for those desiring extreme data integrity, and willing to forego support.

Transparent Compression

Three compression types can be applied to directory or file objects within a btrfs filesystem. The available types are, in preference: zstd, lzo, zlib, and none. The assignment of these compression attributes is maintained as btrfs "metadata," and the kernel will perform the file compression as content is written.

Some background and specifics on the btrfs compression settings:

  • zstd - Code contributed by Facebook, and allows a numerical selector to be specified between 1 and 11, controlling the compression factor of a file applied by the kernel.

  • lzo - Focuses on performance, does not allow tunable options.

  • zlib - Uses the conventional gzip algorithm, and allows a factor to be applied between 1 and 9.

The default compression level, for any tunable algorithm when not specified, is 3.

While compression can be globally set as a mount option for the whole of the mounted filesystem, it can also be applied to specific directories or files. The syntax to assert this property is as follows:

mkdir /bvault/tmp

btrfs property set /bvault/tmp compression zstd

mkdir /bvault/log

btrfs property set /bvault/log compression lzo
btrfs property set /bvault/log compression zlib:9
btrfs property set /bvault/log compression zstd:11

Current properties on a filesystem object can be examined with this syntax:

btrfs property get /bvault/log compression

The result of this command is below:

compression=zstd:11

The metadata containing the compression settings on file and directory objects cannot be backed up with tar or other utilities that are not aware of the btrfs internals for this special status. The Send/Receive section below is able to replicate all metadata to a new btrfs filesystem, and is the only file movement utility that captures these hidden settings.

Unfortunately, no reporting tools are present within the btrfs-progs package to discern compression ratios of files on disk. A brute force method is available, lacking granular reporting, using df in a simple shell function to report only desired mounts.

The following shell function will be used in later examples, please make note of it.

function ddf { typeset a b IFS=\|; df | while read a
  do [[ -z "$b" ]] && printf '%s\n' "$a"; for b; do
  case "$a" in *"$b"*) printf '%s\n' "$a";; esac; done; done }

Informal testing with multiple btrfs filesystems was performed, and reported:

ddf test

For a large copy of binary data, the following compression results emerged:

Filesystem 1K-blocks    Used Available Use% Mounted on
/dev/loop0 52428800  1989740  50009460   4% /test1
/dev/loop1 52428800 19106016  32844480  37% /test2

A more granular utility is available as C source. "[The] compsize [program] takes a list of files on a btrfs filesystem and measures used compression types and effective compression ratio. There is a patch adding support for that; currently it's not merged. You can kind of guess at its compressed size by comparing the output from the df command before and after writing a file, if this is available to you."

To make use of this granular reporting utility, assuming that you have access to a Linux C compiler and are able to prepare compsize, ensure that you have installed the following package:

yum install btrfs-progs-devel

After the btrfs source installation, download the following for C compilation:

https://raw.githubusercontent.com/kilobyte/compsize/master/compsize.c https://raw.githubusercontent.com/kilobyte/compsize/master/radix-tree.c https://raw.githubusercontent.com/kilobyte/compsize/master/endianness.h https://raw.githubusercontent.com/kilobyte/compsize/master/kerncompat.h https://raw.githubusercontent.com/kilobyte/compsize/master/radix-tree.h

Compile these files into a native binary with the following command:

cc -Wall -std=gnu90 -I/usr/include/btrfs \
	-g -o compsize compsize.c radix-tree.c

Test the program. Notice the sync event below - compsize may fail without it, as a sync event appears to prompt kernel compression:

cp /var/log/messages /var/log/secure /bvault/log
sync
./compsize /bvault/log/*

The results of these commands are below:

Processed 2 files, 15 regular extents (15 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL       17%      316K         1.7M         1.7M       
zstd        17%      316K         1.7M         1.7M

To read the documentation on this compression ratio reporting program, download the manual page:

https://raw.githubusercontent.com/kilobyte/compsize/master/compsize.8

Read the manual page with this command:

# man ./compsize.8

Snapshots

Administrators of classic UNIX systems often face requests for recovery from backups, and often these requests cannot be met due to cron schedules of tar that fall outside of and do not capture critical activities. Once-a-day backups do not cover temporary data that is delivered, processed, and deleted between the triggered backup cycles.

Snapshots are "instant photographs" of file system state that are created very quickly. As the filesystem changes blocks, a snapshot retains the old content, preserving it exactly as it appeared. Many snapshots can be taken of a btrfs filesystem and retained as long as disk space is available.

Let's take a snapshot:

cp /etc/passwd /etc/group /etc/shadow /bvault/tmp
btrfs subvolume snapshot /bvault \
	/bvault/snapshot-"$(date +%Y%m%d%H%M%S)"

Now, let's simulate the loss of critical content, and recovery from the snapshot:

rm /bvault/tmp/shadow
ls /bvault/snapshot-*/tmp/shadow

We see that the snapshot has retained our critical file:

/bvault/snapshot-20200831112752/tmp/shadow

It's that easy.

The default snapshots created in btrfs are not read-only objects; content can be added or changed within them:

cp /etc/hosts /bvault/snapshot-*/tmp
ls /bvault/snapshot-*/tmp/hosts

The snapshot has now diverged with this new file:

/bvault/snapshot-20200831112752/tmp/hosts

Read-only snapshots can be created with the -r option. This is likely preferable for backups, and read-only status is required for Send/Receive functions, described below.

There is some discussion in the manual pages of the noatime mount option, relating specifically to snapshots, and a URL is mentioned as a resource for extended discussion. The noatime option has long been known to boost filesystem performance on most UNIX systems, and it takes on additional meaning with btrfs in preventing snapshot growth, playing a further role with critical limits in btrfs, described below.

Snapshots are special cases of mountable btrfs filesystems; they are "subvolumes" as described below, and they are deleted the same way as all subvolumes:

btrfs subvolume delete /bvault/snapshot-*/

A response with an important priviso is issued:

Delete subvolume (no-commit): '/bvault/snapshot-20200831112752'

If a crash occurs after a non-committed subvolume deletion, it may appear after reboot. To force a commit on the deletion, use either the -c or -C options (the difference is explained in the manual pages).

In ZFS, a snapshot is by default read-only, and a further step must be taken to instantiate a snapshot into a read-write "clone." It is unfortunate that btrfs did not reuse this nomenclature for these features.

Resize

Btrfs is reasonably good at growing, and is one of the few common filesystems that is capable of shrinking. In the case of loopback mounts, the file serving as the "backing store" must first be extended before btrfs will recognize additional space, and (unfortunately) the added space will only be recognized after un/remounting. In shrinking, the filesystem is reduced first, then the backing store is truncated. Counter-intuitively, the truncate utility is the fastest tool for growing or shrinking the backing store.

To add ten gigabytes to the caldron, use the following command:

truncate -s +10G /home/CALDRON.BTRFS

Btrfs will not immediately recognize the new space online, but will after after unmounting (note that a remount flag is insufficient to prod btrfs to see the new space):

umount /bvault
mount -o loop /home/CALDRON.BTRFS /bvault
btrfs filesystem resize max /bvault
ddf bvault

The added space appears:

Filesystem 1K-blocks Used Available Use% Mounted on
/dev/loop0  62914560 3652  62372796   1% /bvault

The resize max above is far more convenient than historic volume managers and file systems that require specific numerical sizes at all times (so long HP-UX VxFS).

Btrfs can also contract, and it is capable of moving content (filesystems, snapshots, metadata) out of the way before releasing the requested space. As the manual pages note, "...shrinking could take a long time if there are data in the device area that’s beyond the new end. Relocation of the data takes time."

Let's retract the recently added space:

btrfs filesystem resize -10G /bvault
umount /bvault
truncate -s -10G /home/CALDRON.BTRFS

Needless to say, a backup should be taken before attempting any resize event, especially so for a reduction in size, where the numbers must agree between these utilities down to the byte.

Let's confirm that the vault survived this operation:

btrfs check /home/CALDRON.BTRFS

The check returns extensive status:

Opening filesystem to check...
Checking filesystem on /home/CALDRON.BTRFS
UUID: 5d5b0ada-0ad0-422c-8d91-2cfcc1dee1eb
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 200704 bytes used, no error found
total csum bytes: 4
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 122345
file data blocks allocated: 69632
 referenced 69632

And all's well...

mount -o loop /home/CALDRON.BTRFS /bvault
ddf bvault

...that ends well:

Filesystem 1K-blocks Used Available Use% Mounted on
/dev/loop0  52428800 3652  51887036   1% /bvault

Critical Limits

In the classic FFS of the past days of UNIX, a filesystem administrator was concerned both with space, and with inodes; running out of either brought things to a halt.

Btrfs introduces a new set of worries and behaviors. Data, metadata, and system are all independent entities that can trip an active btrfs filesystem into read-only mode when they are exhausted. Information on these aspects can be found in the btrfs-filesystem manual page, and the command to report them is as follows:

btrfs fi df /bvault

Example sizes for an empty filesystem might be:

Data, single: total=1.01GiB, used=25.36MiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=256.00MiB, used=2.11MiB
GlobalReserve, single: total=3.25MiB, used=0.00B

Historical UNIX variants reserved a certain quantity of disk space for root that was unavailable for other users, preventing a storage crisis from rendering a system unusable. Unfortunately, btrfs lacks such protections, and falling into read-only mode is often unexpected, as the filesystem data might be well-within bounds.

In the previous discussion of the noatime mount option above, the URL mentioned in the btrfs manual pages includes comments on ZFS zpools that only allow 63/64ths of space to be consumed, emulating the historical space reservation. There are further warnings that atime writes on btrfs may fail with out-of-space errors, preventing reads from a btrfs filesystem in a read-only state. The first implulse in such a case would be to remount with noatime.

Extending space is usually the answer to regain control of read-only btrfs prior to cleanup. For loopback mounts, the resizing tools above should be sufficient for easy repair, assuming the host XFS filesystem has available space. When btrfs is native on a disk, it can be wise to shrink it slightly on its partition, allowing a resize max for emergency maintenance.

Alternately, quotas can be introduced, but they come at a cost. As the manual page for btrfs-quota warns, "when quotas are activated, they affect all extent processing, which takes a performance hit." If performance loss is acceptable, then there are methods to use quotas to enforce filesystem safety.

ZFS also does not behave well when overfull, which can cause irreparable damage; btrfs seems somewhat more capable in this regard, but still traps the unwary. ZFS also allows the copies= parameter to duplicate stored data, and applies copies=2 to metadata by default; the DUP tags above show that mkfs.btrfs has done the same, which it always does on rotational media, but not on SSDs.

Subvolumes

Red Hat/Sistina of the past decades brought us a volume manager, similar to that used by HP-UX, that allowed us to transcend classic MS-DOS partitions and span physical disks with a single filesystem. ZFS blurred the volume manager into a "pool," and dissolved filesystems (datasets) into the pool, dispersing them as it saw fit, reassembling visibility when required.

In btrfs, the root volume has swallowed the pool, blurring the focus further.

The creation of a subvolume in btrfs appears to create a directory:

btrfs subvolume create /bvault/foo
btrfs subvolume list /bvault

The new subvolume properties are reported by our last command:

ID 257 gen 23 top level 5 path foo

However, this directory can be pulled out, like the tentacles of a hydra, and mounted elsewhere using the subvol mount parameter:

mkdir /foo
mount -o loop,subvol=foo /home/CALDRON.BTRFS /foo
ddf bvault foo

We see the subvolume on it's newly-assigned mountpoint:

Filesystem 1K-blocks Used Available Use% Mounted on
/dev/loop0 52428800  3716  51887036   1% /bvault
/dev/loop0 52428800  3716  51887036   1% /foo

Moving the mount point is straightforward:

umount /foo
rmdir /foo
mkdir /etc/foo
mount -o loop,subvol=foo /home/CALDRON.BTRFS /etc/foo
ddf bvault foo

Quicker than boiled asparagus, the subvolume is relocated.

Filesystem 1K-blocks Used Available Use% Mounted on
/dev/loop0 52428800  3716  51887036   1% /bvault
/dev/loop0 52428800  3716  51887036   1% /etc/foo

When an OS has been installed on a ZFS dataset with a past snapshot, a rollback event can be triggered on the root dataset/filesystem to revert it, perhaps undoing a failed patch session. Under btrfs, booting a snapshot subvolume to the root, then renaming the damaged root as another subvolume (or simply deleting it), accomplishes the same effect upon a reboot. SUSE provides tools for this activity.

 

Although I have refrained from discussing native btrfs that existed in previous CentOS-centric Linux, Oracle Linux 7 does offer fully-supported btrfs for the root filesystem (/boot remains on XFS, restricting patch rollbacks to non-kernel packages). In the default installation, an attempt to query the root subvolumes:

btrfs subvolume list /

Will produce the following list:

ID 257 gen 44412 top level 5 path root
ID 258 gen 44412 top level 5 path home
ID 261 gen 44409 top level 257 path var/lib/machines

Loopback users of btrfs are unlikely to have extensive needs for complicated subvolume mounts, but the perspective is unique. It is also pleasant to return to fstab, rather than maintaining mounts with a custom tool as ZFS does.

Defragmentation

One greatly lamented lack of ZFS is a method of defragmentation, which btrfs provides.

In a loopback mount configuration on spinning media, the host filesystem should likely be defragmented prior to defragmenting the contents of the "cauldron." With XFS, the defrag tool can report the inodes of files reorganized when run in verbose mode. If desired, an XFS defrag action on the cauldron's backing store can be confirmed by first recording the cauldron's inode:

ls -i /home/CALDRON.BTRFS

This will report the inode for the backing store:

1304021 /home/CALDRON.BTRFS

On a system with spinning media (hard drives, not Solid State Drives/SSDs) run the XFS "filesystem reorganizer," and look for the target inode in /home:

xfs_fsr -v

If your system is a mix of spinning and SSD media, only specify the non-SSD filesystems to xfs_fsr.

With the cauldron consistent, trigger defragments on all subvolumes within it:

btrfs filesystem defragment /bvault -rv
btrfs filesystem defragment /etc/foo -rv

Organizational problems are also sometimes remedied with btrfs-balance, which rewrites the entire filesystem and guarantees that all blocks within btrfs will be reallocated. Balancing is usually performed when adding new drives, but can be useful for improving organization on a single device.

SSD Devices

Flash storage presents special concerns for btrfs, as flash media lacks the longevity of conventional hard drives. Flash comes in two grades, as determined by MOS "floating gate" transistors: Single-Layer Cells (SLC), and Multi-Layer Cells (MLC). Commercial grade flash is universally SLC media, and is generally acknowledged viable for 100,000 writes per cell before risk of degradation and data loss. Multi-Layer Cell (MLC) media, commonly implemented as removable USB flash, is only viable for 5,000 writes before media end of life. SLC media trades data density for longevity, MLC oppositely trades longevity for density.

Flash media can be "healed" by heat in excess of 200 degrees Fahrenheit for an extended period (the higher the temperature, the less time required). Unfortunately, no commonly available flash devices exploit this property, in which an end of life cell can be rejuvenated.

Storage controllers are embedded in all flash media, which implement "wear-leveling algorithms" that migrate data from hot files to rarely-written cells, in an effort to present consistent lifetime for the whole of the storage device. The storage controllers are usually based on embedded ARM or Intel CPUs, and they are not particularly secure.

Btrfs has options that alter write patterns to flash media in an effort to increase longevity. These options are implemented as mount-time flags: ssd and ssd_spread.

"[The ssd mount] optimizations make use of the absence of the seek penalty that’s inherent for the rotational devices. The blocks can be typically written faster and are not offloaded to separate threads... The ssd_spread mount option attempts to allocate into bigger and aligned chunks of unused space, and may perform better on low-end SSDs. ssd_spread implies ssd, enabling all other SSD heuristics as well.

SSD mount options for btrfs have not always been safe for longevity: "Using the ssd mount option with older kernels than 4.14 has a negative impact on usability and lifetime of modern SSDs. This is fixed in 4.14, see this commit for more information... With [Linux kernel] 4.14+ it is safe and recommended again to use the ssd mount option for non-rotational storage."

Btrfs filesystems mounted on loopback devices might not properly detect that the host XFS filesystem resides on flash/SSD. A test with an hpsa controller presenting mirrored MO0200FCTRN solid-state drives did show that ssd was automatically added to the loopback mount options. If your hardware is not properly detected, adding such mount options might extend performance and longevity.

Send/Receive

Btrfs compression settings (and other metadata) are not preserved on files or directories when they are moved or copied outside of a btrfs filesystem. Most backup tools (i.e. tar, rsync) cannot retain these settings. To prevent the loss of this btrfs-property metadata, both btrfs and ZFS offer Send/Receive as a method of [btrfs] metadata preservation. Furthermore, if a migration from CRC32C checksums to a more advanced hash (i.e. XXHASH, SHA256, BLAKE2) with full metadata preservation is desired, these are the only tools for the job.

The ultimate destination of a Send operation must be a btrfs filesysem capable of a Receive of this raw data.

The btrfs send operation must be called on a read-only subvolume, likely a snapshot created for the explicit purpose of (meta)data movement. This is illustrated below, and must be implemented with a kernel that is capable of advanced btrfs checksums:

fallocate -l 50G /home/GOBLET.BTRFS
mkfs.btrfs --csum=sha256 /home/GOBLET.BTRFS
mkdir /goblet
mount -o loop /home/GOBLET.BTRFS /goblet

btrfs subvolume snapshot /bvault /bvault/onmyway -r

btrfs send /bvault/onmyway | btrfs receive /goblet

btrfs subvolume snapshot /bvault/foo /bvault/offwego -r

btrfs send /bvault/offwego | btrfs receive /goblet

These Send/Receive operations will preserve btrfs metadata, including compression settings. No other tool can preserve these file attributes. These tools also have options for incremental backups, expressed as differences between snapshots. Send should likely be part of a complete btrfs backup strategy.

Scrub

Every block written by btrfs, be it data, system, or metadata, is checksummed. These on-storage checksums can be walked as a tree, allowing the entire filesystem can be verified correct.

ZFS, with its focus on redundant sources of data, can silently repair bad blocks if a redundant form is found that is correct. With the btrfs loopback method presented here, data is not redundant, and btrfs can only report and flag data that fails checksum. In the ZFS lingo, loopback btrfs means "hating our data."

It is common to apply firmware updates to storage controllers, physical hard drives, SSD drives, motherboard firmware, and CPU microcode. For example, the HP "Support Pack for Proliant" does exactly when issued every quarter, as do packaged hardware updates for the server platforms of many competitors. There are ample opportunities for mistakes in these avalanches of firmware, and checksum/scrub validations can detect these vendor firmware failures nearly as soon as they occur.

Apart from this, "bitrot happens." Data on magnetic disks goes bad with time, due to many factors. Detecting bad data is better than ignoring it.

In any case, scrub operations should be performed on btrfs loopback devices to assure integrity. The basic syntax for triggering a scrub validation of all block checksums in a btrfs filesystem is:

btrfs scrub start /bvault

The status of a lengthy scrub can be queried:

btrfs scrub status /bvault

Even without redundant data, this should be performed.

Conclusion

There are many "rough edges" that are uncovered above with btrfs capabilities and implementations, especially with the measures taken to enable it for CentOS. Still, this is far better than ext2/3/4 and XFS, discarding all the desirable btrfs features, in that errors can be known because all filesystem content is checksummed.

It would be helpful if the developers of btrfs and ZFS could work together to create a single kernel module, with maximal sharing of "cleanroom" code, that implemented both filesystems. Code purges have happened before, in recent memory; BSD famously expunged AT&T source from the BSD "UNIX" implementation to produce Net/2. A similar effort can be organized to end a single corporation's control of filesystem source.

Oracle is itself unwilling to settle these questions with either a GPL or BSD license release of ZFS. Oracle also delivers a btrfs implementation that is lacking in features, with inapplicable documentation, and out-of-date support tools (for CentOS 8 conversion). Oracle is the impediment, and a community effort to purge ZFS source of Oracle's contributions and unify it with btrfs seems the most straightforward option.

IBM, as the new owner of Red Hat, is in a unique position, in being able to deploy filesystem engineers to clean-room replicate all required Oracle code in a GPL-licensed kernel module implementing both btrfs and ZFS. Should such a filesystem alternately be released under a BSD license, then many independent operating systems might enjoy advanced filesystem features (imagine a Send/Receive between z/OS and Haiku).

It would also be helpful if other parties refrained from new filesystem efforts that lack the extensive btrfs functionality and feature set (i.e. Microsoft ReFS).

Until such a day that an advanced filesystem becomes a ubiquitous commodity as Linux is as an OS, the user community will continue to be torn between questionable support, lack of features, and workarounds in a fragmented btrfs community. This is an uncomfortable place to be, and we would do well to remember the parties responsible for keeping us here.

Charles Fisher has an electrical engineering degree from the University of Iowa and works as a systems and database administrator for a Fortune 500 mining and manufacturing corporation.

Load Disqus comments