Data Deduplication with Linux
Lessfs offers a flexible solution to utilize data deduplication on affordable commodity hardware.
In recent years, the storage industry has been busy providing some of the most advanced features to its customers, including data deduplication. Data deduplication is a unique data compression technique used to eliminate redundant data and decrease the total capacities consumed on an enabled storage volume. A volume can refer to a disk device, a partition or a grouped set of disk devices all represented as single device. During the process of deduplication, redundant data is deleted, leaving a single copy of the data to be stored on the storage volume.
One ideal use-case scenario is when multiple copies of a large e-mail message are distributed and stored on a mail server. An e-mail message the size of just a couple megabytes does not seem too bad, but if it were sent and forwarded to more than 100 recipients—that's more than 200MB of copies of the same file(s).
Another great example is in the arena of host virtualization. In recent years, virtualization has been the hottest trend in server administration. If you are deploying multiple virtual guests across a network that may share the same common operating system image, data deduplication significantly can reduce the total size of capacity consumed to a single copy and, in turn, reference the differences when and where needed.
Again, the primary focus of this technology is to identify large sections of data that can include entire files or large sections of files, which are identical, and store only one copy of it. Other benefits include reduced costs for additional capacities of storage equipment, which, in turn, can be used to increase volume sizes or protect large numbers of existing volumes (such as RAID, archivals and so on). Using less storage equipment also leads to a reduced cost in energy, space and cooling.
Two types of data deduplication exist: post-process and inline deduplication. Each has its advantages and disadvantages. To summarize, post-process deduplication occurs after the data has been written to the storage volume in a separate process. While you are not losing performance in computing the necessary deduplication, multiple copies of a single file will be written multiple times, until post-process deduplication has completed, and this may become problematic if the available capacity becomes low. During inline deduplication, less storage is required, because all deduplication is handled in real time as the data is written to the storage volume, although you will notice a degradation in performance as the process attempts to identify redundant copies of the data coming in.
Storage technology manufacturers have been providing the technology as part of their proprietary and external storage solutions, but with Linux, it also is possible to use the same technology on commodity and very affordable hardware. The solutions provided by these storage technology manufacturers are in some cases available only on the physical device level (that is, the block level) and are able to work only with redundant streams of data blocks as opposed to individual files, because the logic is unable to recognize separate files over the most commonly used protocols, such as SCSI, Serial Attached SCSI (SAS), Fibre Channel, InfiniBand and even Serial ATA (SATA). This is referred to as a chunking method. The filesystem I cover here is Lessfs, a block-level-based deduplication and FUSE-enabled Linux filesystem.
FUSE or Filesystem in USEr Space is a kernel module commonly seen on UNIX-like operating systems, which provides the ability for users to create their own filesystems without touching kernel code. It is designed to run filesystem code in user space while the FUSE module acts as a bridge for communication to the kernel interfaces.
In order to use these filesystems, it is required to install FUSE on the system. Most mainstream Linux distributions, such as Ubuntu and Fedora, most likely will have the module and userland tools already preinstalled, most likely to support the ntfs-3g filesystem.
Lessfs is a high-performance inline data deduplication filesystem written for Linux and is currently licensed under the GNU General Public License version 3. It also supports LZO, QuickLZ and BZip compression (among a couple others), and data encryption. At the time of this writing, the latest stable version is 188.8.131.52, which can be downloaded from the SourceForge project page: http://sourceforge.net/projects/lessfs/files/lessfs.
Before installing the lessfs package, make sure you install all known dependencies for it. Some, if not most, of these dependencies may be available in your distribution's package repositories. You will need to install a few manually though, including mhash, tokyocabinet and fuse (if not already installed).
Your distribution may have the libraries for mhash2 either available or installed, but lessfs still requires mhash. This also can be downloaded from SourceForge: http://sourceforge.net/projects/mhash/files/mhash. At the time of this writing, the latest stable build is 0.9.9.9. Download, build and install the package:
$ tar xvzf mhash-0.9.9.9.tar.gz $ cd mhash-0.9.9.9/ $ ./configure $ make $ sudo make install
Lessfs also requires tokyocabinet, as it is the main database on which it relies. The latest stable build is 1-4.47. To build tokyocabinet, you need to have zlib1g-dev and libbz2-dev already installed, which usually are provided by most, if not all, mainstream Linux distributions.
Download, build and install the
package using the same
sudo make install commands
from earlier. On 32-bit systems, you need to append
to the configure command. Failure to use
--enable-off64 limits the
databases to a 2GB file size.
If it is not already installed or if you want to use the latest
and greatest stable build of FUSE, download it from
SourceForge: http://sourceforge.net/projects/fuse. At
the time of this writing, the latest stable build is 2.8.5. Download, build
and install the package using the same
install commands from earlier.
Petros Koutoupis is a full-time Linux kernel, device-driver and application developer for embedded and server platforms. He has been working in the data storage industry for more than six years and enjoys discussing the same technologies.
|Using Salt Stack and Vagrant for Drupal Development||May 20, 2013|
|Making Linux and Android Get Along (It's Not as Hard as It Sounds)||May 16, 2013|
|Drupal Is a Framework: Why Everyone Needs to Understand This||May 15, 2013|
|Home, My Backup Data Center||May 13, 2013|
|Non-Linux FOSS: Seashore||May 10, 2013|
|Trying to Tame the Tablet||May 08, 2013|
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- New Products
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Home, My Backup Data Center
- Validate an E-Mail Address with PHP, the Right Way
- Tech Tip: Really Simple HTTP Server with Python
- Trying to Tame the Tablet
- New Products
- git-annex assistant
5 hours 41 min ago
- direct cable connection
6 hours 3 min ago
- Agreed on AirDroid. With my
6 hours 14 min ago
- I just learned this
6 hours 18 min ago
6 hours 48 min ago
- not living upto the mobile revolution
9 hours 39 min ago
- Deceptive Advertising and
10 hours 15 min ago
- Let\'s declare that you have
10 hours 16 min ago
- Alterations in Contest Due
10 hours 17 min ago
- At a numbers mindset, your
10 hours 18 min ago