Data Deduplication with Linux
Lessfs offers a flexible solution to utilize data deduplication on affordable commodity hardware.
In recent years, the storage industry has been busy providing some of the most advanced features to its customers, including data deduplication. Data deduplication is a unique data compression technique used to eliminate redundant data and decrease the total capacities consumed on an enabled storage volume. A volume can refer to a disk device, a partition or a grouped set of disk devices all represented as single device. During the process of deduplication, redundant data is deleted, leaving a single copy of the data to be stored on the storage volume.
One ideal use-case scenario is when multiple copies of a large e-mail message are distributed and stored on a mail server. An e-mail message the size of just a couple megabytes does not seem too bad, but if it were sent and forwarded to more than 100 recipients—that's more than 200MB of copies of the same file(s).
Another great example is in the arena of host virtualization. In recent years, virtualization has been the hottest trend in server administration. If you are deploying multiple virtual guests across a network that may share the same common operating system image, data deduplication significantly can reduce the total size of capacity consumed to a single copy and, in turn, reference the differences when and where needed.
Again, the primary focus of this technology is to identify large sections of data that can include entire files or large sections of files, which are identical, and store only one copy of it. Other benefits include reduced costs for additional capacities of storage equipment, which, in turn, can be used to increase volume sizes or protect large numbers of existing volumes (such as RAID, archivals and so on). Using less storage equipment also leads to a reduced cost in energy, space and cooling.
Two types of data deduplication exist: post-process and inline deduplication. Each has its advantages and disadvantages. To summarize, post-process deduplication occurs after the data has been written to the storage volume in a separate process. While you are not losing performance in computing the necessary deduplication, multiple copies of a single file will be written multiple times, until post-process deduplication has completed, and this may become problematic if the available capacity becomes low. During inline deduplication, less storage is required, because all deduplication is handled in real time as the data is written to the storage volume, although you will notice a degradation in performance as the process attempts to identify redundant copies of the data coming in.
Storage technology manufacturers have been providing the technology as part of their proprietary and external storage solutions, but with Linux, it also is possible to use the same technology on commodity and very affordable hardware. The solutions provided by these storage technology manufacturers are in some cases available only on the physical device level (that is, the block level) and are able to work only with redundant streams of data blocks as opposed to individual files, because the logic is unable to recognize separate files over the most commonly used protocols, such as SCSI, Serial Attached SCSI (SAS), Fibre Channel, InfiniBand and even Serial ATA (SATA). This is referred to as a chunking method. The filesystem I cover here is Lessfs, a block-level-based deduplication and FUSE-enabled Linux filesystem.
FUSE or Filesystem in USEr Space is a kernel module commonly seen on UNIX-like operating systems, which provides the ability for users to create their own filesystems without touching kernel code. It is designed to run filesystem code in user space while the FUSE module acts as a bridge for communication to the kernel interfaces.
In order to use these filesystems, it is required to install FUSE on the system. Most mainstream Linux distributions, such as Ubuntu and Fedora, most likely will have the module and userland tools already preinstalled, most likely to support the ntfs-3g filesystem.
Lessfs is a high-performance inline data deduplication filesystem written for Linux and is currently licensed under the GNU General Public License version 3. It also supports LZO, QuickLZ and BZip compression (among a couple others), and data encryption. At the time of this writing, the latest stable version is 184.108.40.206, which can be downloaded from the SourceForge project page: http://sourceforge.net/projects/lessfs/files/lessfs.
Before installing the lessfs package, make sure you install all known dependencies for it. Some, if not most, of these dependencies may be available in your distribution's package repositories. You will need to install a few manually though, including mhash, tokyocabinet and fuse (if not already installed).
Your distribution may have the libraries for mhash2 either available or installed, but lessfs still requires mhash. This also can be downloaded from SourceForge: http://sourceforge.net/projects/mhash/files/mhash. At the time of this writing, the latest stable build is 0.9.9.9. Download, build and install the package:
$ tar xvzf mhash-0.9.9.9.tar.gz $ cd mhash-0.9.9.9/ $ ./configure $ make $ sudo make install
Lessfs also requires tokyocabinet, as it is the main database on which it relies. The latest stable build is 1-4.47. To build tokyocabinet, you need to have zlib1g-dev and libbz2-dev already installed, which usually are provided by most, if not all, mainstream Linux distributions.
Download, build and install the
package using the same
sudo make install commands
from earlier. On 32-bit systems, you need to append
to the configure command. Failure to use
--enable-off64 limits the
databases to a 2GB file size.
If it is not already installed or if you want to use the latest
and greatest stable build of FUSE, download it from
SourceForge: http://sourceforge.net/projects/fuse. At
the time of this writing, the latest stable build is 2.8.5. Download, build
and install the package using the same
install commands from earlier.
Petros Koutoupis is a full-time Linux kernel, device-driver and application developer for embedded and server platforms. He has been working in the data storage industry for more than six years and enjoys discussing the same technologies.