FS-Cache and FUSE for Media Playback QoS

by Ben Martin

The FS-Cache Project works with network filesystems like NFS to maintain a local on-disk cache of network files. The project is split into a kernel module (fscache) and a dæmon (cachefilesd), which help to maintain the disk cache. The local on-disk cache is maintained under a directory on a local filesystem. For example, the /var/fscache directory on the ext3 filesystem /var. The filesystem containing the fscache directory must have the ability to use Extended Attributes (EAs). Such filesystems are quite common and include ext3 and xfs.

Early Fedora Core 6 kernel RPMs contained the fscache kernel module. Unfortunately, around version 2.6.18-1.2868.fc6 of the updated kernels, the module was no longer included. Fedora 7 kernels do not include the kernel module. Hopefully in the future, this module will be available again in standard Fedora kernels. The Fedora Core 6 update kernel 2.6.20-1.2948.fc6 has an FS-Cache patch included, but it does not include the kernel module.

Patches are available for the Linux kernel for the FS-Cache kernel module (see Resources).

The cachefilesd dæmon communicates with the kernel module using either a file in /proc (/proc/fs/cachefiles) or a device file (/dev/cachefiles). Version 0.7 and earlier versions of cachefilesd could communicate only via the proc file; Version 0.8 also can use the device file if it is available with fallback to the proc file.

Setting Up cachefilesd

For Fedora Core 6 and Fedora 7, there is a cachefilesd RPM. Installation without package management should be fairly easy also, as the dæmon mainly consists of a single executable and a configuration file (/etc/cachefilesd.conf).

The two main things that need to be set up in the configuration file are the path of the directory to use under which to store the filesystem cache and options for controlling how much space is acceptable to use on the filesystem containing the cache directory. You also can supply a tag for the cache if you want to have multiple local disk caches operating at the same time.

The space constraints all have acceptable defaults, so the cache directory is the only configuration option you need to pay attention to. Make sure that this directory is acceptable for storing caches and that it exists prior to trying to start cachefilesd. For a media PC, using a directory on a Flash memory card or on a RAM disk is a good option.

Because the cache directory must have extended attributes, and your tmpfs might not include support for them, you may have to create an ext3 filesystem in a single file inside your tmpfs filesystem and then use the embedded ext3 filesystem for the cachefilesd path. The ext3 filesystem inside the single file happily will support extended attributes. Because the whole ext3 filesystem is in a single file on a RAM disk, it will not cause distracting disk IO on the media PC.

The fstab entry in Listing 1 sets up both a 64MB of RAM filesystem and the mountpoint for the embedded ext3 filesystem. The commands shown in Listing 2 set up the embedded ext3 filesystem. As the cache.ext3fs filesystem exists only in RAM, you have to add these commands to your /etc/rc.local or a suitable boot-time script to set up the cache directory after a reboot. This script has to be called before cachefilesd is started. Leaving cachefilesd out of your standard init run-level startups and starting it manually from the rc.local just after you set up the cache.ext3fs embedded filesystem is a good solution.

If the cache directory is on a persistent filesystem, such as /var, set cachefilesd to start automatically, as shown in Listing 3.

Listing 1. Using a RAM Disk to Store the Local fscache On-Disk Cache

tmpfs /var/fscache tmpfs size=64m,user,user_xattr   0 0
/var/fscache/cache.ext3fs /var/fscache/cache 
 ↪ext3 loop=/dev/loop1,user_xattr,noauto 0 0

Listing 2. Setting Up the Embedded ext3 Filesystem

# mount /var/fscache
# cd /var/fscache
# dd if=/dev/zero of=cache.ext3fs \
      bs=1024 count=65536
# mkfs.ext3 -F cache.ext3fs 
# mount cache.ext3fs

Listing 3. Starting the cachefilesd Dæmon and Setting It to Auto-Start Next Boot

$ su -l
# service cachefilesd start
# chkconfig cachefilesd on

The space constraints in the configuration file are used to set the percentage of available blocks and files on the filesystem containing the local cache directory that should be used. For each of these two resource types, there are three thresholds: cull-off, cull-start and cache-off. When the cull-off limit is reached, no culling of the disk cache is performed, and when the cull-start limit is reached, culling of the disk cache begins. For example, for the disk block type constraint, setting cull-off at 20% and cull-start at 10% means that as long as the disk has more than 20% free blocks, nothing from the cache will be culled. Once the disk reaches 10% free blocks, cache culling begins to free up some space. If the disk manages to get to the cache-off limit (say, 5%), the cache will be disabled until there is more cache-off space available again.

The configuration options are prefixed with b for block type constraint and f for the files-available constraint. The configuration file has a slightly different naming method from that used above. For block constraints, the cull-off limit is called brun. For cull-start, the limit is called bcull, and cache-off is called bstop.

Modifying Mounts

To turn on FS-Cache for a mountpoint, you have to pass it the fsc mount option. I noticed that I had to enable FS-Cache for all mountpoints for a given NFS server, or FS-Cache would not maintain its cache. This should not be much of an issue for a machine being used as a media PC, because you likely will not mind having all NFS mounts from the file server being cached.

The fstab entry shown in Listing 4 includes the fsc option. Adding this fsc option to all mountpoint references to fileserver:/... will enable FS-Cache.

Listing 4. fstab Entry for Mounting an NFS Directory on the Fileserver with FS-Cache

fileserver:/foo  /foo  nfs bg,intr,soft,fsc  0 0
Preemptive Caching

At this stage, FS-Cache will store a local cache copy of files, or part thereof, which is read off the file server. What we really want is to have data from files we are viewing on the media PC to be read ahead into the local disk cache.

To get information into the local disk cache, we can use a FUSE module as a shim between the NFS mountpoint and the application viewing the media. With FUSE, you can write a filesystem as an application in the user address space and access it through the Linux kernel just like any other filesystem. To keep things simple, I refer to the application that provides a FUSE filesystem simply as a FUSE module.

The FUSE filesystem should take the path to the NFS filesystem we want to cache (the delegate) and a mountpoint where the FUSE filesystem is exposed by the kernel. For example, if we have a /HomeMovies NFS mount where we store all our digital home movies, the FUSE module might be mounted on /CacheHomeMovies and will take the path /HomeMovies as the delegate path.

When /CacheHomeMovies is read, the FUSE module will read the delegate (/HomeMovies) and show the same directory contents. When the file /CacheHomeMovies/venice-2001.dv is read, the FUSE module reads the information from /HomeMovies/venice-2001.dv and returns that. Effectively, /CacheHomeMovies will appear just the same as /HomeMovies to an application.

At this stage, we have not gained anything over using /HomeMovies directly. However, in the read(2) implementation of the FUSE module, we could just as easily ask the delegate (/HomeMovies) to read in what the application requested and the next 4MB of data. The FUSE module could just throw away that extra information. The mere act of the FUSE module reading the 4MB of data will trigger FS-Cache to read it over the network and store it in the local disk cache.

The main advantage of using FUSE is to allow caching to work properly when the video playback is sought. The main disadvantage is the extra local copying where the FUSE module asks for more information than is returned to the video player. This can be mitigated by having the FUSE module request only the extra information every now and then—for example, reading ahead only when 2MB of data has been consumed by the application.

For optimal performance, the read-ahead should happen either in a separate thread of control in the FUSE module and use readahead(2) or asynchronous IO, so that the video playback application is not blocked waiting for a large read-ahead request to complete.

The FUSE Shim

The fuselagefs package is a C++ wrapper for FUSE. It includes the Delegatefs superclass, which provides support for implementing FUSE modules that take a delegate filesystem and add some additional functionality. The Delegatefs is a perfect starting point for writing simple shim filesystems like the above nfs-readahead FUSE module.

The read-ahead algorithm is designed to read 8MB using asynchronous IO, and when the first 4MB of that is shown to the application using the FUSE filesystem, it then reads another 8MB using asynchronous IO. So there should be, at worst, 4MB of cached data always available to the FUSE module.

The C++ class to implement the shim is about 70 lines of code, as shown in Listing 5. Two offsets are declared to keep track of what the file offset was in the previous call to fs_read() and at what offset we should launch another asynchronous read-ahead call. The aio_buffer_sz is declared constant as an enum so it can use it to declare the size of aio_buffer. When aio_consume_window bytes of the aio_buffer are shown to the application using the FUSE filesystem, another read-ahead is performed. If debug_readahread_aio is true, the FUSE module explicitly waits for the asynchronous read-ahead to finish before returning. This is handy when debugging to ensure that the return value of the asynchronous IO is valid. A non-illustrative example would have some callback report if an asynchronous IO operation has failed.

The main job of schedule_readahread_aio() is possibly to execute a single asynchronous read-ahead call. It updates m_startNextAIOOffset to tell itself when the next asynchronous read-ahead call should be made. The forceNewReadAHead parameter allows the caller to force a new asynchronous read-ahead for cases such as when a seek has been performed.

The fs_read() method is a virtual method from Delegatefs. It has similar semantics to the pread(2) system call. Data should be read into a buffer of a given size at a nominated offset. The fs_read() method is called by FUSE indirectly. The main logic of our fs_read() is to check whether the given offset is in a logical sequence from the last read call. If the offset is not sequential from the last byte returned from the previous read(), fs_read() will force schedule_readahread_aio() to perform another read ahead. schedule_readahread_aio() is always called from fs_read() so it can handle the sliding asynchronous read-ahead window.

As Delegatefs knows how to read bytes from the Delegate filesystem, we then can simply return by calling up to the base class. The remainder of nfs-fuse-readahead-shim.cpp is taken up by parsing command-line options, and instead of returning from main(), it calls the main method of a Delegatefs through an instance of the CustomFilesystem class. The shim is compiled with the Makefile shown in Listing 6.

Listing 5. Entire FUSE Shim C++ Class


#include <fuselagefs/fuselagefs.hh>
using namespace Fuselage;
using namespace Fuselage::Helpers;

#include <aio.h>
#include <errno.h>

#include <string>
#include <iostream>
using namespace std;
...
class CustomFilesystem
 :
 public Delegatefs
{
 typedef Delegatefs _Base;
 off_t m_oldOffset;
 off_t m_startNextAIOOffset;
 enum
 {
   aio_buffer_sz = 8 * 1024 * 1024,
   aio_consume_window = aio_buffer_sz / 2,
   debug_readahread_aio = false
 };
 char aio_buffer[ aio_buffer_sz ];
    
 void schedule_readahread_aio( int fd, 
     off_t offset, bool forceNewReadAHead )
 {
   if( m_startNextAIOOffset <= offset 
        || forceNewReadAHead )
   {
     cerr << "Starting an async read request"
          << " at offset:" << offset << endl;

     ssize_t retval; ssize_t nbytes; 
     struct aiocb arg; 
     bzero( &arg, sizeof (struct aiocb)); 
     arg.aio_fildes = fd;
     arg.aio_offset = offset; 
     arg.aio_buf = (void *) aio_buffer; 
     arg.aio_nbytes = aio_buffer_sz; 
     arg.aio_sigevent.sigev_notify = SIGEV_NONE; 
 
     retval = aio_read( &arg );
     if( retval < 0 )
       cerr << "error starting aio request!" 
            << endl;
 
     m_startNextAIOOffset = offset 
        + aio_consume_window;

     if( debug_readahread_aio )
     {
       while ( (retval = aio_error( &arg ) ) 
           == EINPROGRESS )
       {}
       cerr << "aio_return():" 
            << aio_return( &arg ) 
             << endl;
      }
    }
 }
    
public:

 CustomFilesystem()
 :
 _Base(),
 m_startNextAIOOffset( 0 ),
 m_oldOffset( -1 )
 {
 }
    
 virtual int fs_read( const char *path, 
    char *buf, size_t size,
    off_t offset, struct fuse_file_info *fi)
 {
   cerr << "fs_read() offset:" << offset
        << " sz:" << size << endl;
   int fd = fi->fh;

   bool forceNewReadAHead = false;
   if( (offset - size) != m_oldOffset )
   {
     cerr << "possible seek() between read()s!" 
          << endl;
     forceNewReadAHead = true;
     aio_cancel( fd, 0 );
   }
   schedule_readahread_aio( fd, offset, 
                            forceNewReadAHead );
   m_oldOffset = offset;
   return _Base::fs_read( path, buf, 
                          size, offset, fi );
 }
};

Listing 6. Makefile for the FUSE Shim

nfs-fuse-readahead-shim: nfs-fuse-readahead-shim.cpp
	g++ nfs-fuse-readahead-shim.cpp \
          -o nfs-fuse-readahead-shim \
          -D_FILE_OFFSET_BITS=64 -lfuselagefs
Taking It for a Spin

A simple application that reads from a given file at a predetermined rate can verify that the cache is being populated as expected, as shown in Listing 7. There isn't a great deal of error checking going on, but things that would cause grief, such as failed read()s, are reported to the console. The application repeatedly reads 4KB chunks at a time from a nominated file and throws away the result. Every 256KB status is reported, so that the application can be closed knowing roughly what byte of the file was last read.

Listing 7. simpleread.cpp Reads from argv[1] at a Nominated usec Rate in argv[2]


#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>

#include <iostream>
#include <sstream>
using namespace std;

int main( int argc, char** argv )
{
    cerr << "opening argv[1]:" << argv[1] << endl;
    
    long offset = 0;
    int fd = open( argv[1], O_RDONLY );

    unsigned long usec = 10000;
    if( argc > 2 )
    {
        stringstream ss;
        ss << argv[2];
        ss >> usec;
    }
    cerr << "using delay of usec:" << usec << endl;
    
    const int bufsz = 4096;
    char buf[ bufsz ];
    bool error = false;
    
    while( true )
    {
        ssize_t rc = read( fd, buf, bufsz );
        if( rc > 0 )
        {
            if( error )
            {
                cerr << "reading resumed" << endl;
            }
            error = false;
            offset += rc;
        }
        else if( rc == 0 )
        {
            cerr << "end of file" << endl;
            exit(0);
        }
        else
        {
            error = true;
            cerr << "read error:" << errno 
                 << " at offset:" << offset 
                 << endl;
        }
        usleep( usec );
        if( offset % (1024*256) == 0 )
            cerr << "offset:" << offset << endl;
    }
    return 0;
}

As shown in Listing 8, we first clean out the cache directory and restart cachefilesd. Then, the NFS share is mounted and the FUSE shim run against it to create a /Cache-HomeMovies directory. The FUSE executable is told to remain in the foreground, which stops FUSE from running it as a dæmon, allowing standard output and standard error of the FUSE filesystem to be displayed. We use bash to put the nfs-fuse-readahead-shim into the background (though still having its standard outputs redirected into a capture file) and run the simpleread for a little more than 500KB of data. Then, both the simpleread and nfs-fuse-readahead-shim are stopped to investigate whether the cache has been populated as expected.

Listing 8. Running simpleread against the FUSE Shim


# rm -rf /var/fscache/*
# /etc/init.d/cachefilesd restart
# mount fileserver:/HomeMovies /HomeMovies -o fsc
# nfs-fuse-readahead-shim --fuse-forground \
  -u /HomeMovies /Cached-HomeMovies \
  >|/tmp/nfs-fuse-out 2>&1 \
  &

# simpleread /Cached-HomeMovies/venice-2001.dv 1000
using delay of usec:1000
offset:262144
offset:524288
^C
# fg
^C
# 

The simpleread was stopped after reading only a little more than half a megabyte. However, the FUSE module has an asynchronous IO call at the start, requesting 8MB of data be sent to it. Poking around in /var/fscache for a file with the same size as venice-2001.dv should reveal the cache file. Comparing the first 8MB of this cache file to the version on the NFS share should show that the first 8MB is identical. Note that the local cached file is read first to make sure that the subsequent use of the NFS share does not populate the cache file before it is read. This is shown in Listing 9.

Listing 9. Checking That the Cache Has Read the First 8MB

# cd /var/fscache
# ll -R
...
---------- 1 root root 800M Jun 10 02:19 Ek0...000000
# dd if=./path/to/Ek0...000000 \
   of=/tmp/8mb bs=1024 count=8192
# dd if=/HomeMovies/venice-2001.dv \
   of=/tmp/8mb.real bs=1024 count=8192
# diff /tmp/8mb.real /tmp/8mb
#
Wrap-Up

One restriction on FS-Cache is that it will not cache files opened with O_DIRECT or for writing.

By taking advantage of the kernel FS-Cache code, the FUSE module to handle read-ahead can be very simple to create. The Delegatefs C++ FUSE base class allows one to implement additional features very easily when applications perform IO.

The FUSE nfs-fuse-readahead-shim module is started just as shown in Listing 8 and when the --fuse-forground option is not passed, nfs-fuse-readahead-shim runs silently as a dæmon.

Ben Martin has been working on filesystems for more than ten years. He is currently working toward a PhD combining Semantic Filesystems with Formal Concept Analysis to improve human-filesystem interaction.

Load Disqus comments

Firstwave Cloud