Data Deduplication with Linux
Shifting focus back to lessfs preparation, note that the lessfs volume's options can be defined by the user when mounting. For instance, you can define the desired options for big_write, max_read and max_write. The big_write improves throughput when used for backup purposes, and both max_read and max_write must be defined to use it. The max_read and max_write options always must be equal to one another and define the block size for lessfs to use: 4, 8, 16, 32, 64 and 128KB.
The definition of a block size can be used to tune the filesystem. For example, a larger block size, such as 128KB (131072), offers faster performance but, unfortunately, at the cost of less deduplication (remember from earlier that lessfs uses block-level deduplication). All other options are FUSE-generic options defined in the FUSE documentation. An example of the use of supported mount options can be found in the lessfs man page:
$ man 1 lessfs
The following example is given to mount lessfs with a 128KB block size:
$ sudo lessfs /etc/lessfs.cfg /fuse -o negative_timeout=0,\ entry_timeout=0,attr_timeout=0,use_ino,\ readdir_ino, default_permissions,allow_other,big_writes,\ max_read=131072,max_write=131072
Additional configurable options for the database exist in your lessfs.cfg file (the same file you copied over to the /etc directory path earlier). The block size can be defined here as well as even the method of additional data compression to use on the deduplicated data and more. Below is an excerpt of what the configuration file contains. In order to define a new value for various options clearly, just uncomment the option desired and, in turn, comment everything else:
BLKSIZE=131072 #BLKSIZE=65536 #BLKSIZE=32768 #BLKSIZE=16384 #BLKSIZE=4096 #COMPRESSION=none COMPRESSION=qlz #COMPRESSION=lzo #COMPRESSION=bzip #COMPRESSION=deflate #COMPRESSION=disabled
This excerpt defines the default block size to 128KB and the default compression method to QuickLZ. If the defaults are not to your liking, in this file you also can define the commit to disk intervals (default is 30 seconds) or a new path for your databases, but make sure to initialize the databases before use; otherwise, you'll get an error when you try to mount the lessfs filesystem.
Now, Linux is not limited to a single data deduplication solution. There also is SDFS, a file-level deduplication filesystem that also runs on the FUSE module. SDFS is a freely available cross-platform solution (Linux and Windows) made available by the Opendedup Project. On its official Web site, the project highlights the filesystem's scalability (it can dedup a petabyte or more of data); speed, performing deduplication/reduplication at a line speed of 290MB/s and higher; support for VMware while also mentioning its usage in Xen and KVM; flexibility in storage, as deduplicated data can be stored locally, on the network across multiple nodes (NFS/CIFS and iSCSI), or in the cloud; inline and batch mode deduplication (a method of post-process deduplication); and file and folder snapshot support. The project seems to be pushing itself as an enterprise-class solution, and with features like these, Opendedup means business.
It is also not surprising that since 2008, data deduplication has been a requested feature for Btrfs, the next-generation Linux filesystem. Although that also may be in response to Sun Microsystem's (now Oracle's) development of data deduplication into its advanced ZFS filesystem. Unfortunately, at this point in time, it is unknown if and when Btrfs will introduce data deduplication support, although it already contains support for various types of data compression (such as zlib and LZO).
Currently, the lessfs2 release is under development, and it is supposed to introduce snapshot support, fast inode cloning, new databases (including hamsterdb and possibly BerkeleyDB) apart from tokyocabinet, self-healing RAID (to repair corrupted chunks) and more.
As you can see, with a little time and effort, it is relatively simple to utilize the recent trend of data deduplication to reduce the total capacity consumed on a storage volume by removing all redundant copies of data. I recommend its usage in not only server administration but even for personal use, primarily because with implementations such as lessfs, even if there isn't too much redundant data, the additional data compression will help reduce the total size of the file when it is eventually written to disk. It is also worth mentioning that the lessfs-enabled volume does not need to remain local to the host system, but it also can be exported across a network via NFS to even iSCSI and utilized by other devices within that same network, providing a more flexible solution.
Official Lessfs Project Web Site: http://www.lessfs.com
Lessfs SourceForge Project: http://sourceforge.net/projects/lessfs
Opendedup (SDFS) Project: http://www.opendedup.org
Wikipedia: Data Deduplication: http://en.wikipedia.org/wiki/Data_deduplication
Notes on the Integration of Lessfs into Fedora 15: http://fedoraproject.org/wiki/Features/LessFS
Lessfs with SCST How-To: http://www.lessfs.com/wordpress/?page_id=577
Petros Koutoupis is currently a senior software developer at Cleversafe, an IBM Company. He is also the creator and maintainer of the RapidDisk Project. Petros has worked in the data storage industry for more than a decade.