Data Deduplication with Linux


After resolving all the more obscure dependencies, you're ready to build and install the lessfs package. Download, build and install the package using the same configure, make and sudo make install commands from earlier.

Now you're ready to go, but before you can do anything, some preparation is needed. In the lessfs source directory, there is a subdirectory called etc/, and in it is a configuration file. Copy the configuration file to the system's /etc directory path:

$ sudo cp etc/lessfs.cfg /etc/

This file defines the location of the databases among a few other details (which I discuss later in this article, but for now let's concentrate on getting the filesystem up and running). You will need to create the directory path for the file data (default is /data/dta) and also for the metadata (default is /data/mta) for all file I/O operations sent to/from the lessfs filesystem. Create the directory paths:

$ sudo mkdir -p /data/{dta,mta}

Initialize the databases in the directory paths with the mklessfs command:

$ sudo mklessfs -c /etc/lessfs.cfg

The -c option is used to specify the path and name of the configuration file. A man page does not exist for the command, but you still can invoke the on-line menu with the -h command option.

Now that the databases have been initialized, you're ready to mount a lessfs-enabled filesystem. In the following example, let's mount it to the /mnt path:

$ sudo lessfs /etc/lessfs.cfg /mnt

When mounted, the filesystem assumes the total capacity of the filesystem to which it is being mounted. In my case, it is the filesystem on /dev/sda1:

$ df -t fuse.lessfs
Filesystem        1K-blocks      Used Available Use% Mounted on
lessfs              5871080   3031812   2541028  55% /mnt

$ df -t ext4
Filesystem        1K-blocks      Used Available Use% Mounted on
/dev/sda1           5871080   3031812   2541028  55% /

Currently, you should see nothing but a hidden .lessfs subdirectory when listing the contents of the newly mounted lessfs volume:

$ ls -a /mnt/
.  ..  .lessfs

Once mounted, the lessfs volume can be unmounted like any other volume:

$ sudo umount /mnt

Let's put the volume to the test. Writing file data to a lessfs volume is no different from what it would be to any other filesystem. In the example below, I'm using the dd command to write approximately 100MB of all zeros to /mnt/test.dat:

$ sudo dd if=/dev/zero of=/mnt/test.dat bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 5.05418 s, 20.7 MB/s

Seeing how the filesystem is designed to eliminate all redundant copies of data and being that a file filled with nothing but zeros qualifies as a prime example of this, you can observe that only 48KB of capacity was consumed, and that may just be nothing more than the necessary data synchronized to the databases:

$ df -t fuse.lessfs
Filesystem        1K-blocks      Used Available Use% Mounted on
lessfs              5871080   3031860   2540980  55% /mnt

If you list a detailed listing of that same file in the lessfs-enabled directory, it appears that all 100MB have been written. Utilizing its embedded logic, lessfs reconstructs all data on the fly when additional read and write operations are initiated to the file(s):

$ ls -l
total 102400
-rw-r--r-- 1 root root 104857600 2011-02-26 13:57 test.dat

Now, let's work with something a bit more complex—something containing a lot of random data. For this example, I decided to download the latest stable release candidate of the Linux kernel source from, but before I did, I listed the total capacity consumed available on the lessfs volume as a reference point:

$ df -t fuse.lessfs
Filesystem        1K-blocks      Used Available Use% Mounted on
lessfs              5871080   3031896   2540944  55% /mnt

$ sudo wget

Listing the contents, you can see that the package is approximately 75MB:

$ ls -l linux-2.6.38-rc6.tar.bz2 
-rw-r--r-- 1 root root 74783787 2011-02-21 19:50 

Listing the capacity used to store the Linux kernel source archive yields a difference of roughly 75MB:

$ df -t fuse.lessfs
Filesystem        1K-blocks      Used Available Use% Mounted on
lessfs              5871080   3106440   2466400  56% /mnt

Now, let's create a copy of the archived kernel source:

$ sudo cp linux-2.6.38-rc6.tar.bz2 linux-2.6.38-rc6.tar.bz2-bak

$ ls -l linux-2.6.38-rc6.tar.bz2*
-rw-r--r-- 1 root root 74783787 2011-02-21 19:50 
-rw-r--r-- 1 root root 74783787 2011-02-26 14:43 

By having a redundant copy of the same file, an additional 44KB is consumed—not nearly as much as an additional 75MB:

$ df -t fuse.lessfs
Filesystem        1K-blocks      Used Available Use% Mounted on
lessfs              5871080   3106484   2466356  56% /mnt

And, because the databases contain the actual file and metadata, if an accidental or intentional system reboot occurred, or if for whatever reason you need to unmount the filesystem, the physical data will not be lost. All you need to do is invoke the same mount command and everything is restored:

$ sudo umount /mnt/
$ sudo lessfs /etc/lessfs.cfg /mnt
$ ls
linux-2.6.38-rc6.tar.bz2  linux-2.6.38-rc6.tar.bz2-bak

In the situation when a system suffers from an accidental reboot, possibly due to power loss, as of version 1.0.4, lessfs supports transactions, which eliminates the need for an fsck after a crash.


Petros Koutoupis is a full-time Linux kernel, device-driver and application developer for embedded and server platforms. He has been working in the data storage industry for more than six years and enjoys discussing the same technologies.


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Enterprise, HSM type solutions?

Zaak's picture

Nice article. I work in an academic lab where we crunch massive amounts of data, and storage is always a huge headache for us. In the past we've had access to HSM storage management solutions, but the slowest tier has always been tape. It turns out that getting your data back from tape takes longer in some cases than just recomputing it, which already takes weeks on HPCs. It seems to me that if you could create HSM type solution with a fast parallel file system, like lustre, as the fastest storage tier and a compressed, deduplicated file system on slower, cheaper magnetic disks you might have a more reasonable, cost effecctive storage system for HPC. (I have not run any numbers though, an I'm not sure wahether yoou could build a system like this with OTS software/hardware.)


Not Linux, But take a look at SmartOS from Joyent

forq's picture

If you want to take advantage of de-duplication in your basement or development lab for your virtual machines you could consider using SmartOS as the underlying hypervisor platform. It comes with KVM as the hypervisor and ZFS as the filesystem. To enable de-dupe in ZFS it is simply: "zfs set dedup=on pool/filesystem", plus all the other awesome features of ZFS. Instant snapshots, clones, compression, etc. Then you can run your favorite GNU/Linux platform on top of it with de-duplication happening under the hypervisor. This ZFS de-duplication is all open-source and hails from the Illumos kernel.

Great post

apexwm's picture

This is a great post and I've often wondered how GNU/Linux gets support for deduplication at the filesystem level. Great stuff and just another example of open source at its best.

Great post

apexwm's picture

This is a great post and I've often wondered how GNU/Linux gets support for deduplication at the filesystem level. Great stuff and just another example of open source at its best.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState