Data Deduplication with Linux
After resolving all the more obscure dependencies, you're ready
to build and install the lessfs package. Download, build and install the
package using the same configure,
make and sudo make install commands
from earlier.
Now you're ready to go, but before you can do anything, some preparation is needed. In the lessfs source directory, there is a subdirectory called etc/, and in it is a configuration file. Copy the configuration file to the system's /etc directory path:
$ sudo cp etc/lessfs.cfg /etc/
This file defines the location of the databases among a few other details (which I discuss later in this article, but for now let's concentrate on getting the filesystem up and running). You will need to create the directory path for the file data (default is /data/dta) and also for the metadata (default is /data/mta) for all file I/O operations sent to/from the lessfs filesystem. Create the directory paths:
$ sudo mkdir -p /data/{dta,mta}
Initialize the databases in the directory paths with the mklessfs command:
$ sudo mklessfs -c /etc/lessfs.cfg
The -c option is used to specify the path and name of the configuration file. A man page does not exist for the command, but you still can invoke the on-line menu with the -h command option.
Now that the databases have been initialized, you're ready to mount a lessfs-enabled filesystem. In the following example, let's mount it to the /mnt path:
$ sudo lessfs /etc/lessfs.cfg /mnt
When mounted, the filesystem assumes the total capacity of the filesystem to which it is being mounted. In my case, it is the filesystem on /dev/sda1:
$ df -t fuse.lessfs
Filesystem 1K-blocks Used Available Use% Mounted on
lessfs 5871080 3031812 2541028 55% /mnt
$ df -t ext4
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 5871080 3031812 2541028 55% /
Currently, you should see nothing but a hidden .lessfs subdirectory when listing the contents of the newly mounted lessfs volume:
$ ls -a /mnt/
. .. .lessfs
Once mounted, the lessfs volume can be unmounted like any other volume:
$ sudo umount /mnt
Let's put the volume to the test. Writing file data to a lessfs volume is no different from what it would be to any other filesystem. In the example below, I'm using the dd command to write approximately 100MB of all zeros to /mnt/test.dat:
$ sudo dd if=/dev/zero of=/mnt/test.dat bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 5.05418 s, 20.7 MB/s
Seeing how the filesystem is designed to eliminate all redundant copies of data and being that a file filled with nothing but zeros qualifies as a prime example of this, you can observe that only 48KB of capacity was consumed, and that may just be nothing more than the necessary data synchronized to the databases:
$ df -t fuse.lessfs
Filesystem 1K-blocks Used Available Use% Mounted on
lessfs 5871080 3031860 2540980 55% /mnt
If you list a detailed listing of that same file in the lessfs-enabled directory, it appears that all 100MB have been written. Utilizing its embedded logic, lessfs reconstructs all data on the fly when additional read and write operations are initiated to the file(s):
$ ls -l
total 102400
-rw-r--r-- 1 root root 104857600 2011-02-26 13:57 test.dat
Now, let's work with something a bit more complex—something containing a lot of random data. For this example, I decided to download the latest stable release candidate of the Linux kernel source from http://www.kernel.org, but before I did, I listed the total capacity consumed available on the lessfs volume as a reference point:
$ df -t fuse.lessfs
Filesystem 1K-blocks Used Available Use% Mounted on
lessfs 5871080 3031896 2540944 55% /mnt
$ sudo wget http://www.kernel.org/pub/linux/kernel/v2.6/
↪testing/linux-2.6.38-rc6.tar.bz2
Listing the contents, you can see that the package is approximately 75MB:
$ ls -l linux-2.6.38-rc6.tar.bz2
-rw-r--r-- 1 root root 74783787 2011-02-21 19:50
↪linux-2.6.38-rc6.tar.bz2
Listing the capacity used to store the Linux kernel source archive yields a difference of roughly 75MB:
$ df -t fuse.lessfs
Filesystem 1K-blocks Used Available Use% Mounted on
lessfs 5871080 3106440 2466400 56% /mnt
Now, let's create a copy of the archived kernel source:
$ sudo cp linux-2.6.38-rc6.tar.bz2 linux-2.6.38-rc6.tar.bz2-bak
$ ls -l linux-2.6.38-rc6.tar.bz2*
-rw-r--r-- 1 root root 74783787 2011-02-21 19:50
↪linux-2.6.38-rc6.tar.bz2
-rw-r--r-- 1 root root 74783787 2011-02-26 14:43
↪linux-2.6.38-rc6.tar.bz2-bak
By having a redundant copy of the same file, an additional 44KB is consumed—not nearly as much as an additional 75MB:
$ df -t fuse.lessfs
Filesystem 1K-blocks Used Available Use% Mounted on
lessfs 5871080 3106484 2466356 56% /mnt
And, because the databases contain the actual file and metadata, if an accidental or intentional system reboot occurred, or if for whatever reason you need to unmount the filesystem, the physical data will not be lost. All you need to do is invoke the same mount command and everything is restored:
$ sudo umount /mnt/
$ sudo lessfs /etc/lessfs.cfg /mnt
$ ls
linux-2.6.38-rc6.tar.bz2 linux-2.6.38-rc6.tar.bz2-bak
In the situation when a system suffers from an accidental reboot,
possibly due to power loss, as of version 1.0.4, lessfs supports
transactions, which eliminates the need for an fsck after a crash.
Petros Koutoupis is a full-time Linux kernel, device-driver and application developer for embedded and server platforms. He has been working in the data storage industry for more than six years and enjoys discussing the same technologies.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Speed Up Your Web Site with Varnish | Jun 19, 2013 |
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
- Speed Up Your Web Site with Varnish
- Containers—Not Virtual Machines—Are the Future Cloud
- Linux Systems Administrator
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Senior Perl Developer
- Technical Support Rep
- Non-Linux FOSS: libnotify, OS X Style
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- RSS Feeds
- excellent
6 min 46 sec ago - good point!
9 min 37 sec ago - Varnish works!
18 min 44 sec ago - Reply to comment | Linux Journal
48 min 21 sec ago - Reply to comment | Linux Journal
3 hours 14 min ago - Reply to comment | Linux Journal
7 hours 14 min ago - Yeah, user namespaces are
8 hours 30 min ago - Cari Uang
12 hours 1 min ago - user namespaces
14 hours 55 min ago - yea
15 hours 20 min ago





Comments
Enterprise, HSM type solutions?
Nice article. I work in an academic lab where we crunch massive amounts of data, and storage is always a huge headache for us. In the past we've had access to HSM storage management solutions, but the slowest tier has always been tape. It turns out that getting your data back from tape takes longer in some cases than just recomputing it, which already takes weeks on HPCs. It seems to me that if you could create HSM type solution with a fast parallel file system, like lustre, as the fastest storage tier and a compressed, deduplicated file system on slower, cheaper magnetic disks you might have a more reasonable, cost effecctive storage system for HPC. (I have not run any numbers though, an I'm not sure wahether yoou could build a system like this with OTS software/hardware.)
-Zaak
Not Linux, But take a look at SmartOS from Joyent
If you want to take advantage of de-duplication in your basement or development lab for your virtual machines you could consider using SmartOS as the underlying hypervisor platform. It comes with KVM as the hypervisor and ZFS as the filesystem. To enable de-dupe in ZFS it is simply: "zfs set dedup=on pool/filesystem", plus all the other awesome features of ZFS. Instant snapshots, clones, compression, etc. Then you can run your favorite GNU/Linux platform on top of it with de-duplication happening under the hypervisor. This ZFS de-duplication is all open-source and hails from the Illumos kernel.
Great post
This is a great post and I've often wondered how GNU/Linux gets support for deduplication at the filesystem level. Great stuff and just another example of open source at its best.
Great post
This is a great post and I've often wondered how GNU/Linux gets support for deduplication at the filesystem level. Great stuff and just another example of open source at its best.