Data Deduplication with Linux
Lessfs offers a flexible solution to utilize data deduplication on affordable commodity hardware.
In recent years, the storage industry has been busy providing some of the most advanced features to its customers, including data deduplication. Data deduplication is a unique data compression technique used to eliminate redundant data and decrease the total capacities consumed on an enabled storage volume. A volume can refer to a disk device, a partition or a grouped set of disk devices all represented as single device. During the process of deduplication, redundant data is deleted, leaving a single copy of the data to be stored on the storage volume.
One ideal use-case scenario is when multiple copies of a large e-mail message are distributed and stored on a mail server. An e-mail message the size of just a couple megabytes does not seem too bad, but if it were sent and forwarded to more than 100 recipients—that's more than 200MB of copies of the same file(s).
Another great example is in the arena of host virtualization. In recent years, virtualization has been the hottest trend in server administration. If you are deploying multiple virtual guests across a network that may share the same common operating system image, data deduplication significantly can reduce the total size of capacity consumed to a single copy and, in turn, reference the differences when and where needed.
Again, the primary focus of this technology is to identify large sections of data that can include entire files or large sections of files, which are identical, and store only one copy of it. Other benefits include reduced costs for additional capacities of storage equipment, which, in turn, can be used to increase volume sizes or protect large numbers of existing volumes (such as RAID, archivals and so on). Using less storage equipment also leads to a reduced cost in energy, space and cooling.
Two types of data deduplication exist: post-process and inline deduplication. Each has its advantages and disadvantages. To summarize, post-process deduplication occurs after the data has been written to the storage volume in a separate process. While you are not losing performance in computing the necessary deduplication, multiple copies of a single file will be written multiple times, until post-process deduplication has completed, and this may become problematic if the available capacity becomes low. During inline deduplication, less storage is required, because all deduplication is handled in real time as the data is written to the storage volume, although you will notice a degradation in performance as the process attempts to identify redundant copies of the data coming in.
Storage technology manufacturers have been providing the technology as part of their proprietary and external storage solutions, but with Linux, it also is possible to use the same technology on commodity and very affordable hardware. The solutions provided by these storage technology manufacturers are in some cases available only on the physical device level (that is, the block level) and are able to work only with redundant streams of data blocks as opposed to individual files, because the logic is unable to recognize separate files over the most commonly used protocols, such as SCSI, Serial Attached SCSI (SAS), Fibre Channel, InfiniBand and even Serial ATA (SATA). This is referred to as a chunking method. The filesystem I cover here is Lessfs, a block-level-based deduplication and FUSE-enabled Linux filesystem.
FUSE or Filesystem in USEr Space is a kernel module commonly seen on UNIX-like operating systems, which provides the ability for users to create their own filesystems without touching kernel code. It is designed to run filesystem code in user space while the FUSE module acts as a bridge for communication to the kernel interfaces.
In order to use these filesystems, it is required to install FUSE on the system. Most mainstream Linux distributions, such as Ubuntu and Fedora, most likely will have the module and userland tools already preinstalled, most likely to support the ntfs-3g filesystem.
Lessfs
Lessfs is a high-performance inline data deduplication filesystem written for Linux and is currently licensed under the GNU General Public License version 3. It also supports LZO, QuickLZ and BZip compression (among a couple others), and data encryption. At the time of this writing, the latest stable version is 1.3.3.1, which can be downloaded from the SourceForge project page: http://sourceforge.net/projects/lessfs/files/lessfs.
Before installing the lessfs package, make sure you install all known dependencies for it. Some, if not most, of these dependencies may be available in your distribution's package repositories. You will need to install a few manually though, including mhash, tokyocabinet and fuse (if not already installed).
Your distribution may have the libraries for mhash2 either available or installed, but lessfs still requires mhash. This also can be downloaded from SourceForge: http://sourceforge.net/projects/mhash/files/mhash. At the time of this writing, the latest stable build is 0.9.9.9. Download, build and install the package:
$ tar xvzf mhash-0.9.9.9.tar.gz
$ cd mhash-0.9.9.9/
$ ./configure
$ make
$ sudo make install
Lessfs also requires tokyocabinet, as it is the main database on which it relies. The latest stable build is 1-4.47. To build tokyocabinet, you need to have zlib1g-dev and libbz2-dev already installed, which usually are provided by most, if not all, mainstream Linux distributions.
Download, build and install the
package using the same configure,
make and sudo make install commands
from earlier. On 32-bit systems, you need to append
--enable-off64
to the configure command. Failure to use
--enable-off64 limits the
databases to a 2GB file size.
If it is not already installed or if you want to use the latest
and greatest stable build of FUSE, download it from
SourceForge: http://sourceforge.net/projects/fuse. At
the time of this writing, the latest stable build is 2.8.5. Download, build
and install the package using the same configure,
make and sudo make
install commands from earlier.
Petros Koutoupis is a full-time Linux kernel, device-driver and application developer for embedded and server platforms. He has been working in the data storage industry for more than six years and enjoys discussing the same technologies.
Trending Topics
| OpenLDAP Everywhere Reloaded, Part I | May 23, 2012 |
| Chemistry the Gromacs Way | May 21, 2012 |
| Make TV Awesome with Bluecop | May 16, 2012 |
| Hack and / - Password Cracking with GPUs, Part I: the Setup | May 15, 2012 |
| An Introduction to Application Development with Catalyst and Perl | May 14, 2012 |
| Cryptocurrency: Your Total Cost Is 01001010010 | May 09, 2012 |
- OpenLDAP Everywhere Reloaded, Part I
- Strip DRM from WMV File
- Validate an E-Mail Address with PHP, the Right Way
- Boot with GRUB
- Why Python?
- A Statistical Approach to the Spam Problem
- Chapter 16: Ubuntu and Your iPod
- Why Hulu Plus Sucks, and Why You Should Use It Anyway
- Building an Ultra-Low-Power File Server with the Trim-Slice
- Science the GNU Way, Part I
- Editorial Standards?
4 hours 7 min ago - Great one
5 hours 42 min ago - Common form in many
6 hours 3 min ago - Awsome
11 hours 6 min ago - Euro 2012 Coupon Codes - Get 20% Off Pavtube TiVo Converter
3 days 9 hours ago - Euro 2012 Big Sale: 20% Off Instant Savings on TiVo Converter
3 days 9 hours ago - MakeMKV works as well, though
3 days 10 hours ago - Euro 2012 Big Sale: 20% Off Instant Savings on TiVo Converter
3 days 10 hours ago - Awesome
4 days 8 hours ago - Who worries approx the
4 days 10 hours ago





Comments
Enterprise, HSM type solutions?
Nice article. I work in an academic lab where we crunch massive amounts of data, and storage is always a huge headache for us. In the past we've had access to HSM storage management solutions, but the slowest tier has always been tape. It turns out that getting your data back from tape takes longer in some cases than just recomputing it, which already takes weeks on HPCs. It seems to me that if you could create HSM type solution with a fast parallel file system, like lustre, as the fastest storage tier and a compressed, deduplicated file system on slower, cheaper magnetic disks you might have a more reasonable, cost effecctive storage system for HPC. (I have not run any numbers though, an I'm not sure wahether yoou could build a system like this with OTS software/hardware.)
-Zaak
Not Linux, But take a look at SmartOS from Joyent
If you want to take advantage of de-duplication in your basement or development lab for your virtual machines you could consider using SmartOS as the underlying hypervisor platform. It comes with KVM as the hypervisor and ZFS as the filesystem. To enable de-dupe in ZFS it is simply: "zfs set dedup=on pool/filesystem", plus all the other awesome features of ZFS. Instant snapshots, clones, compression, etc. Then you can run your favorite GNU/Linux platform on top of it with de-duplication happening under the hypervisor. This ZFS de-duplication is all open-source and hails from the Illumos kernel.
Great post
This is a great post and I've often wondered how GNU/Linux gets support for deduplication at the filesystem level. Great stuff and just another example of open source at its best.
Great post
This is a great post and I've often wondered how GNU/Linux gets support for deduplication at the filesystem level. Great stuff and just another example of open source at its best.