Tarsnap: On-line Backups for the Truly Paranoid

Storing backups in the cloud requires a level of trust that not everyone is willing to give. While the convenience and low cost of automated, off-site backups is very compelling, the reality of putting personal data in the hands of complete strangers will never sit quite right with some people.

Enter Tarsnap—"on-line backups for the truly paranoid". Tarsnap is the brainchild of Dr Colin Percival, a former FreeBSD Security Officer. In 2006, he began research and development on a new solution for "encrypted, snapshotted remote backups", culminating in the release of Tarsnap in 2008.

Unlike other on-line backup solutions, Tarsnap uses an open, documented cryptographic design that securely encrypts your files. Rather than trusting a vendor's cryptographic claims, you have full access to the source code, which uses open-source libraries and industry-vetted protocols, such as RSA, AES and SHA.

Tarsnap provides a command-line client that operates very much like the traditional UNIX tar command. Familiar syntax, such as tarsnap cf and tarsnap xvf works as users would expect, except that instead of manipulating local tarballs, the client is working with cloud-based archives. These archives are stored on Amazon S3 with EC2 servers to handle client connections. To minimize bandwidth costs, Tarsnap uses smart rsync-like block-oriented snapshot operations that upload only data set changes and minimize transmission costs.

Although you could roll your own backup system by encrypting your data, running an rsync server off-site and so on, Tarsnap's scriptability, professional cryptographic design and utility pricing make it an attractive service for clueful users.

Open-Source Strong Crypto

Tarsnap compresses, encrypts and cryptographically signs every byte you send to it. No knowledge of cryptographic protocols is required, but if you are an interested enthusiast or feel more comfortable if you can look under the hood, you'll find the source code and full design thoroughly documented on http://tarsnap.com.

During installation, the user creates a keyfile, which optionally can be passphrase-protected. The file contains two 2048-bit RSA keys for signing and encryption. AES-256 session keys are created at runtime and encrypted with the RSA keys. The HMAC-SHA-256 hash function is used to verify data blocks and prevent tampering, and the same hash function is used elsewhere to verify communication between the client and server.

If those cryptographic terms make your eyes glaze over, the simple description is that Tarsnap encrypts and verifies your data in motion and at rest using industry-standard protocols. From a user perspective, the only thing that needs to be safeguarded is the keyfile. As you might expect, if it's lost, the keys to decrypt data stored with Tarsnap are unavailable and your data is no longer usable. After generation, this keyfile ideally should be stored securely off-site.

Efficient Design

Tarsnap backs up variable-length, deduplicated blocks of data. This means that if you change one file in a directory, only that file is backed up. Indeed, only the changes to that file are backed up if the file previously existed. If you move files around, Tarsnap is smart enough to recognize and skip previously backed-up blocks. Anytime you wish, you can take a backup of a directory, which creates a new archive. Behind the scenes, Tarsnap will retain whatever files are necessary to make that archive whole.

Suppose you have a directory with 100MB of files and a 10% daily change rate. If you create a Tarsnap archive every day of that directory, you will see the following consumption pattern:

  • Monday (initial): 100MB archive created (100MB uploaded, 100MB stored).

  • Tuesday: 10MB archive created (10MB new data uploaded, 110MB total data stored).

  • Wednesday: 10MB archive created (10MB new data uploaded, 120MB total data stored).

If on Thursday you delete the Monday archive, the Tuesday archive will be "made whole" by retaining whatever files from Monday are needed. This gives you tremendous flexibility to pick restore points. Of course, you can keep multiple archives with however many restore points you wish.

Your Current Account Balance Is $4.992238237884881224

Tarsnap works on a prepaid utility-metered model. Subscribers deposit a minimum of $5.00 and are charged only for the storage and bandwidth they consume. Although the cost is higher than plain Amazon S3 service, it reflects both the cryptographic, compression and deduplication value-add of Tarsnap. At the time of this writing, Tarsnap costs 30 cents per gigabyte-month for storage and 30 cents per gigabyte transmitted.

This cost may make Tarsnap infeasible for large, whole-server terabyte-size backups. However, it is ideal for critical, sensitive files that must be durable, available and safe in the event an attacker succeeded in compromising them. With no minimum charge or monthly fees, Tarsnap is very economical for small data sets or for data that compresses well. Some examples:

  • Backing up 100MB of files with 10% daily change rate for a month would cost only 30 cents.

  • A gigabyte that is backed up weekly with a 20% change rate would cost $1.40 a month.

Tarsnap bills based on attodollars (quintillionths of a dollar) to avoid profiting through rounding. This means your account balance is tracked to 18 decimal places. This is not just "pay by the drink" cloud pricing—it's practically "pay by the atom". Some users find that a small deposit lasts them months or years.

Important Flexibility

One of Tarsnap's best features is how easy it is to script. The ability to put a tarsnap cf command into a shell script makes use in cron jobs very straightforward, which encourages unattended, automated backups—the best kind.

Crucially, Tarsnap also supports a division of responsibilities. You can use the tarsnap-keymgmt tool to create keyfiles with limited authority. You may have one keyfile that lives on your server with permission to create archives, but not the authority to delete them. A master key with full privileges could be kept off-site, so that if attackers were to compromise your server, they would be unable to destroy your backups.

Using Tarsnap

To get started with Tarsnap, register at tarsnap.com, deposit some funds into your account, and download the client.

The client is available only as source, but the straightforward ./configure ; make install process is very easy. The client is supported on all major Linux distributions (as well as BSD-based systems). Take a quick peek at the download page to make sure you have the required operating system packages, as some of the development packages are not installed in typical Linux configurations.

If you are using a firewall, be aware that Tarsnap communicates via TCP on port 9279.

There are only two critical configuration items: the location of your keyfile and the location of your Tarsnap cache. Both are set in /usr/local/etc/tarsnap.conf. A tarsnap.conf.example is provided, and you probably can just copy the example as is. It defines your Tarsnap key as /root/tarsnap.key and your cache directory as /usr/local/tarsnap-cache, which will be created if it doesn't exist. The cachedir is a small state-tracking directory that lets Tarsnap keep track of backups.

Next, register your machine as follows. In this case, I'm setting up Tarsnap service for a machine called helicarrier. The e-mail address and password are the ones I used when I signed up for service with Tarsnap:


# tarsnap-keygen --keyfile /root/tarsnap.key 
 ↪--user andrew@fabbro.org --machine helicarrier
Enter tarsnap account password:
#

I have a directory I'd like to back up with Tarsnap:


# ls -l /docs
total 2092
-rw-rw---- 1 andrew 1833222 Jun 14 16:38 2011 Tax Return.pdf
-rw------- 1 andrew 48568 Jun 14 16:41 andrew_passwords.psafe3
-rw------- 1 tina   14271 Jun 14 16:42 tina_passwords.psafe3
-rw-rw-r-- 1 andrew 48128 Jun 14 16:41 vacation_hotels.doc
-rw-rw-r-- 1 andrew 46014 Jun 14 16:35 vacation_notes.doc
-rw-rw-r-- 1 andrew 134959 Jun 14 16:44 vacation_reservation.pdf

To back up, I just tell Tarsnap what name I want to call my archive ("docs.20120701" in this case) and which directory to back up. There's no requirement to use a date string in the archive name, but it makes versioning straightforward, as you'll see:


# tarsnap cf docs.20120701 /docs
tarsnap: Removing leading '/' from member names
                                 Total size  Compressed size
All archives                        2132325          1815898
  (unique data)                     2132325          1815898
This archive                        2132325          1815898
New data                            2132325          1815898

In my tarsnap.conf, I enabled the print-stats directive, which gives the account report shown. Note the compression, which reduces storage costs and improves cryptographic security. The "compressed size" of the "unique data" shows how much data is actually stored at Tarsnap, and you pay only for the compressed size.

The next day, I back up docs again to "docs.20120702". If I haven't made many changes, the backup will proceed very quickly and use little additional space:


# tarsnap cf docs.20120702 /docs
tarsnap: Removing leading '/' from member names
                                 Total size  Compressed size
All archives                        4264650          3631796
  (unique data)                     2132770          1816935
This archive                        2132325          1815898
New data                                445             1037

As you can see, although the amount of data for "all archives" has grown, the actual amount of "unique data" has barely increased. Tarsnap is smart enough to avoid backing up data that has not changed.

Now let's list the archives Tarsnap has stored:


# tarsnap --list-archives
docs.20120701
docs.20120702

To demonstrate Tarsnap's smart approach to storage further, I will delete the oldest backup:


# tarsnap df docs.20120701
                                 Total size  Compressed size
All archives                        2132325          1815898
  (unique data)                     2132325          1815898
This archive                        2132325          1815898
Deleted data                            445             1037

The "all archives" number has dropped because now I have only one archive, but the "unique data" has not changed much because it is still retaining all files necessary to satisfy my "docs.20120702" archive. If I list it, I can see my data is still there:


# tarsnap tvf docs.20120702
drwxrwxr-x  0 andrew 0 Jun 14 20:52 docs/
-rw-------  0 andrew 48568 Jun 14 16:41 docs/andrew_passwords.psafe3
-rw-rw-r--  0 andrew 46014 Jun 14 16:35 docs/vacation_notes.doc
-rw-rw-r--  0 andrew 134959 Jun 14 16:44 docs/vacation_reservation.pdf
-rw-rw-r--  0 andrew 48128 Jun 14 16:41 docs/vacation_hotels.doc
-rw-------  0 tina   14271 Jun 14 16:42 docs/tina_passwords.psafe3
-rw-rw----  0 andrew 1833222 Jun 14 16:38 docs/2011 Tax Return.pdf

I use a date string for convenient versioning, but I could just as easily use any naming convention for the archive, such as "docs.1", "docs.2" and so on. For my personal backups, I have a cron job that invokes Tarsnap nightly with a date-string-named archive:


tarsnap cf docs.`date '%+Y%m%d'` /docs

If I have a local calamity and want to restore that data, it is just another simple Tarsnap command to get my files back. Note that like traditional tar, Tarsnap removes the leading slash so all files are restored relative to the current working directory:


# cd /
# rm -rf docs
# tarsnap xvf docs.20120702
x docs/
x docs/andrew_passwords.psafe3
x docs/vacation_notes.doc
x docs/vacation_reservation.pdf
x docs/vacation_hotels.doc
x docs/tina_passwords.psafe3
x docs/2011 Tax Return.pdf
Tips

If you want to run Tarsnap as a nonroot user, create a .tarsnaprc file in your home directory. The syntax is identical to the tarsnap.conf discussed above. For example:


$ cat ~/tarsnap.conf
cachedir /home/andrew/tarsnap-cache
keyfile /home/andrew/tarsnap.key
print-stats

If you have other services or users contending for your Internet connection, use --maxbw-rate to specify a maximum bytes per second that Tarsnap will be allowed to use.

The print-stats command gives you account status information when used interactively, but for batch operations (such as running Tarsnap out of cron), you can suppress the output by removing that directive from your tarsnap.conf or by invoking Tarsnap with --no-print-stats.

Finally, you can play with the --dry-run and -v flags to simulate Tarsnap backup operations without actually burning network and disk. Once you've got your command constructed exactly as you want it, remove --dry-run.

License

Tarsnap is not distributed under an open-source license, although all client source is provided (and compiled by the user during install). However, the company regularly contributes back to projects whose code it utilizes, such as libarchive. Tarsnap also has open-sourced some of its own projects, including the scrypt package, the spiped secure pipe dæmon and the kivaloo NoSQL data store.

Further Information

The Tarsnap home page (http://tarsnap.com) has a wealth of documentation and information, as well as links to the Tarsnap IRC channel, mailing list and FAQ. The "Technical Details" section is absorbing reading for those interested in the deep details of Tarsnap's cryptographic approach and history.

Tarsnap also pays significant cash bounties for bugs found in the product, ranging from a few dollars for small cosmetic bugs up to a couple thousand dollars if someone finds a serious security flaw. This transparent approach is further comfort for the truly paranoid.

Tarsnap's current version is 1.0.32, released on February 22, 2012, for Linux, BSD, OS X, Solaris, Minix and Cygwin.

Q&A with Dr Colin Percival

Dr Colin Percival

A: I've heard "startup company" defined as "time to double revenue or the number of users is measured in months", and I've heard "highly successful startup company" defined as "time to double revenue or the number of users is measured in weeks". By those definitions, Tarsnap is a startup company, but not a highly successful one.

And yes, that is dodging the question, but Tarsnap is my primary source of income, and I come from a culture that considers someone's income to be a private matter, so I don't want to publish precise numbers.

Q. Looking at your Bug Bounties page, you've paid out more than $2,000 to users who've submitted bug reports. Why a "bug bounty" system as opposed to the traditional bug reporting in open-source projects?

A: I gave a talk about this at BSDCan'12 and AusCERT'12 (in consecutive weeks, no less—a word of advice, don't ever try to attend back-to-back conferences 10,000 miles apart). In short, bug bounties help get more people looking at code, and help encourage people to report anything they see that seems wrong.

Probably a better question is why I offered bounties for all bugs rather than just for security bugs—aside from Knuth's famous prizes, I don't know of any other case where bounties extend beyond security vulnerabilities. The answer here is roughly the same though—there's a lot of people who won't look for security bugs because they don't have a security background, but offering cash for all bugs gets them interested...and my experience with FreeBSD is that many security vulnerabilities aren't found by security people, but rather by other developers just saying "something here looks weird".

Q. What do you think are the weakest points of the Tarsnap design, from a security perspective? Is there anything that should keep the truly paranoid up at night?

The weakest link in Tarsnap's security, without question, is me. I wrote the Tarsnap client code to encrypt and sign everything on the client side, but how do you know that it does what I claim?

If you're paranoid, you should look at the code yourself and make sure it does what I claim it does. And when I release a new version, you should look at that too—compare against the previous version and make sure that the changes are sensible.

If the CIA kidnaps me and threatens to torture me until I decrypt someone's data, I won't be able to do anything to help them. But if the CIA kidnaps me and threatens to torture me unless I insert a Trojan horse into the next version of Tarsnap...well, I'm not optimistic about my ability to withstand torture.

Q. Tarsnap's client is very UNIX-nerd-friendly, with its familiar tar syntax. Do you have any interest or plans for a graphical interface for less-sophisticated users?

A: This is absolutely something I plan on doing...in the future. I have very little experience with anything GUI, so I'll probably end up paying someone to produce a GUI—ideally an open-source GUI for tar, since Tarsnap uses almost exactly the same command-line interface, and I think a good GUI front end to tar would be useful for its own sake too.

Q. Why attodollars and not femtodollars or zeptodollars?

A: With attodollars, I can express everything as a pair of 64-bit integers.

Q. Do you have any new features in the works for upcoming Tarsnap releases?

A: I'm mostly working on performance improvements these days. There's a few frequently requested features that I might add, but in general, there are good reasons when features don't exist—either they're impossible (for example, server-side expiration of old archives—the server has no way to know which blocks should be freed when an old archive is deleted) or would require a substantial redesign (for example, renaming archives—the client-server protocol has write transactions and delete transactions, but renaming would need to atomically write some files and delete others).

Backup image via Shutterstock.com.

Load Disqus comments