Data Deduplication with Linux
Shifting focus back to lessfs preparation, note that the lessfs volume's options can be defined by the user when mounting. For instance, you can define the desired options for big_write, max_read and max_write. The big_write improves throughput when used for backup purposes, and both max_read and max_write must be defined to use it. The max_read and max_write options always must be equal to one another and define the block size for lessfs to use: 4, 8, 16, 32, 64 and 128KB.
The definition of a block size can be used to tune the filesystem. For example, a larger block size, such as 128KB (131072), offers faster performance but, unfortunately, at the cost of less deduplication (remember from earlier that lessfs uses block-level deduplication). All other options are FUSE-generic options defined in the FUSE documentation. An example of the use of supported mount options can be found in the lessfs man page:
$ man 1 lessfs
The following example is given to mount lessfs with a 128KB block size:
$ sudo lessfs /etc/lessfs.cfg /fuse -o negative_timeout=0,\
entry_timeout=0,attr_timeout=0,use_ino,\
readdir_ino, default_permissions,allow_other,big_writes,\
max_read=131072,max_write=131072
Additional configurable options for the database exist in your lessfs.cfg file (the same file you copied over to the /etc directory path earlier). The block size can be defined here as well as even the method of additional data compression to use on the deduplicated data and more. Below is an excerpt of what the configuration file contains. In order to define a new value for various options clearly, just uncomment the option desired and, in turn, comment everything else:
BLKSIZE=131072
#BLKSIZE=65536
#BLKSIZE=32768
#BLKSIZE=16384
#BLKSIZE=4096
#COMPRESSION=none
COMPRESSION=qlz
#COMPRESSION=lzo
#COMPRESSION=bzip
#COMPRESSION=deflate
#COMPRESSION=disabled
This excerpt defines the default block size to 128KB and the default compression method to QuickLZ. If the defaults are not to your liking, in this file you also can define the commit to disk intervals (default is 30 seconds) or a new path for your databases, but make sure to initialize the databases before use; otherwise, you'll get an error when you try to mount the lessfs filesystem.
Summary
Now, Linux is not limited to a single data deduplication solution. There also is SDFS, a file-level deduplication filesystem that also runs on the FUSE module. SDFS is a freely available cross-platform solution (Linux and Windows) made available by the Opendedup Project. On its official Web site, the project highlights the filesystem's scalability (it can dedup a petabyte or more of data); speed, performing deduplication/reduplication at a line speed of 290MB/s and higher; support for VMware while also mentioning its usage in Xen and KVM; flexibility in storage, as deduplicated data can be stored locally, on the network across multiple nodes (NFS/CIFS and iSCSI), or in the cloud; inline and batch mode deduplication (a method of post-process deduplication); and file and folder snapshot support. The project seems to be pushing itself as an enterprise-class solution, and with features like these, Opendedup means business.
It is also not surprising that since 2008, data deduplication has been a requested feature for Btrfs, the next-generation Linux filesystem. Although that also may be in response to Sun Microsystem's (now Oracle's) development of data deduplication into its advanced ZFS filesystem. Unfortunately, at this point in time, it is unknown if and when Btrfs will introduce data deduplication support, although it already contains support for various types of data compression (such as zlib and LZO).
Currently, the lessfs2 release is under development, and it is supposed to introduce snapshot support, fast inode cloning, new databases (including hamsterdb and possibly BerkeleyDB) apart from tokyocabinet, self-healing RAID (to repair corrupted chunks) and more.
As you can see, with a little time and effort, it is relatively simple to utilize the recent trend of data deduplication to reduce the total capacity consumed on a storage volume by removing all redundant copies of data. I recommend its usage in not only server administration but even for personal use, primarily because with implementations such as lessfs, even if there isn't too much redundant data, the additional data compression will help reduce the total size of the file when it is eventually written to disk. It is also worth mentioning that the lessfs-enabled volume does not need to remain local to the host system, but it also can be exported across a network via NFS to even iSCSI and utilized by other devices within that same network, providing a more flexible solution.
Resources
Official Lessfs Project Web Site: http://www.lessfs.com
Lessfs SourceForge Project: http://sourceforge.net/projects/lessfs
Opendedup (SDFS) Project: http://www.opendedup.org
Wikipedia: Data Deduplication: http://en.wikipedia.org/wiki/Data_deduplication
Notes on the Integration of Lessfs into Fedora 15: http://fedoraproject.org/wiki/Features/LessFS
Lessfs with SCST How-To: http://www.lessfs.com/wordpress/?page_id=577
- « first
- ‹ previous
- 1
- 2
- 3
Petros Koutoupis is a full-time Linux kernel, device-driver and application developer for embedded and server platforms. He has been working in the data storage industry for more than six years and enjoys discussing the same technologies.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
| Introduction to MapReduce with Hadoop on Linux | Jun 05, 2013 |
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Linux Systems Administrator
- Validate an E-Mail Address with PHP, the Right Way
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Introduction to MapReduce with Hadoop on Linux
- RSS Feeds
- One advantage with VMs
27 min 48 sec ago - about info
1 hour 57 sec ago - info
1 hour 1 min ago - info
1 hour 2 min ago - info
1 hour 4 min ago - info
1 hour 5 min ago - abut info
1 hour 7 min ago - info
1 hour 8 min ago - info
1 hour 10 min ago - info
1 hour 11 min ago





Comments
Enterprise, HSM type solutions?
Nice article. I work in an academic lab where we crunch massive amounts of data, and storage is always a huge headache for us. In the past we've had access to HSM storage management solutions, but the slowest tier has always been tape. It turns out that getting your data back from tape takes longer in some cases than just recomputing it, which already takes weeks on HPCs. It seems to me that if you could create HSM type solution with a fast parallel file system, like lustre, as the fastest storage tier and a compressed, deduplicated file system on slower, cheaper magnetic disks you might have a more reasonable, cost effecctive storage system for HPC. (I have not run any numbers though, an I'm not sure wahether yoou could build a system like this with OTS software/hardware.)
-Zaak
Not Linux, But take a look at SmartOS from Joyent
If you want to take advantage of de-duplication in your basement or development lab for your virtual machines you could consider using SmartOS as the underlying hypervisor platform. It comes with KVM as the hypervisor and ZFS as the filesystem. To enable de-dupe in ZFS it is simply: "zfs set dedup=on pool/filesystem", plus all the other awesome features of ZFS. Instant snapshots, clones, compression, etc. Then you can run your favorite GNU/Linux platform on top of it with de-duplication happening under the hypervisor. This ZFS de-duplication is all open-source and hails from the Illumos kernel.
Great post
This is a great post and I've often wondered how GNU/Linux gets support for deduplication at the filesystem level. Great stuff and just another example of open source at its best.
Great post
This is a great post and I've often wondered how GNU/Linux gets support for deduplication at the filesystem level. Great stuff and just another example of open source at its best.