Dedupe to tape: Are you crazy?

W Curtis Preston, at Backup Central has posted an interesting entry, “Is Dedupe to tape crazy?. Even he admits in the first sentence that, yes, dedupe to tape is crazy. But then he bumps the crazy-fest up a notch by asking whether it’s crazy-bad or crazy-good.

You should read the article, but let me jump ahead to the end. He says it’s good, at least in certain cases. I say it’s bad, in any case where you’d like to actually retrieve your data, rather than minimize your costs.

Here’s why…

To understand that this is a bad idea, you’ve got to know what deduplication is first. As wikipedia puts it succinctly, deduplication is the elimination of redundant data. The

data is stored in one place, and all references to that data are stored as a shorter index number which points to the deduplicated data. Think of it like symlinks in your filesystem, if you’d like, except the symlinks are block level.

When you’re storing things on disk, this leads to near-miraculous disk savings. Want to store 5 years worth of full backups, but only use the disk space of the equivalent incremental backups? No problem! Store 50 copies of the same directory tree in various differing hierarchies, but only use the disk space of one? Done!

Now, what happens when the time comes to back up those 50 different hierarchies? Well, in the time honored, tape-expensive version, you use the tape equivalent of 50 copies of the data.

What Mr Preston is suggesting is that for long term storage, instead of storing 50 copies of the same data, store the actual data once, and back up the pointers to the data on the various backup sets. The argument for this is that if a full backup of the data takes 10 tapes, rather than 50 * 10 tapes, you can do 50 * 1, where the 50 different backup sets look at the 1 tape for the deduplicated data. This is a massive cost savings by any measure.

My problem with this is tape failure. If one of the 50 individual backup tapes fails, it’s no problem. Sure, you lose that particular arrangement of the data, but it’s not that big of an issue. Unfortunate, sure, but not tragic. If you lose the 1 tape that contains the deduplicated data, though, then you immediately have a Bad Day(tm).

Essentially, you are betting on one tape not failing over the course of (in the argument of Mr Preston) 7+ years. And if something does happen in that 7 years, whether it’s degaussing, loss, theft, fire, water, or aliens, you don’t lose one backup set. You lose every backup that referenced that set of data.

So I would, if I could afford one, buy a deduplicated storage array in a heartbeat for my backup needs. But I would not trust a deduplcated archival system at all. The odds of loss are too great, and it’s not worth the savings. I’d rather cut the frequency of my backups than save money by making my archives co-dependent.

But I could be wrong. Feel free to comment and let me know if I am.

It should probably be noted that Preston wrote about this too. The difference is, of course, that he knows what he’s talking about… :-)

  • I completely agree with you. Nothing more to say.


  • We are using an rsync based backup to disk tool ( It uses rsync’s –link-destination feature for file-based de-duplication between runs. This saves space, because after the initial run all backup runs are incremental. But you have a full backup for every run as non-changed files are just hardlinks to their already existing copy on the backup volume.

    And we do at least bi-weekly tape backups of the current snapshot to store them off-site.

    There are other tools (rsnapshot, rdiff-backup) with a similar approach.

    This may or may not work in your environment, especially the original data should be rsync-friendly (e.g. a directory tree with only some not so big files changing, i.e. not a database).

  • I see dedupe to tape being about as irritating and slow as file level recovery from block level backups.

    File level recovery from block level backups requires an interim recovery of required blocks from media into a cache, with files/data then reconstructed out of that cache. The more heavily fragmented the files originally at backup time, the slower and more painful this process is (or the significantly larger the cache space required to minimise media passes!)

    I don’t see how recovery from dedupe tape would be any different from this.

  • James

    The old backup system: Full backup Saturday night, differentials the rest of the week. This requires us to have both the full backup tape from Saturday night and the nightly differential in order to perform a restoration

    The new backup system: Full backup Saturday night, dedupe backups the rest of the week. This requires us to have both the full backup tape from Saturday night and the nightly dedupe in order to perform a restoration.

    The more things change…

    Frankly, this deduping sounds like a defensive measure implemented by a tech in a fight with a Dilbert style PHB who has declared that incrementals/differentials are bad since they are not full backups. The PHB doesn’t care that full backups take longer than 24 hours, he just declares that the tech must get everything working. The tech knows that he needs to do full backup on Saturday, then incrementals or differentials during the week. So the tech creates this new name ‘deduping’. The PHB is happy and goes back to his golf game, and the data is safe under the same general procedures that have been working fine for thirty or so years in the computer industry.

  • UX-admin

    “If you lose the 1 tape that contains the deduplicated data, though, then you immediately have a Bad Day(tm).”

    That’s why you do something called “cloning” at the end of the month, where all your full backup sets are consolidated onto a clone. “Deduped” clone in this case, of course.

    For example, Legato NetWorker has the capability to produce clones (and automatically, too). I don’t know if it has the capability to deduplicate, but the groundwork is there. Other backup software likely has some sort of cloning capability too, and if it doesn’t, it should. It’s one’s ‘insurance policy’ and consolidation after all, and that’s not to be taken lightly.

  • Pingback: Industry bloggers debate dedupe to tape - Storage Soup()