Backups Suck

Date February 6, 2009

Many thanks to Michael Janke for this blog entry

Years ago we had a period of time where we had nothing but problems with backups. Tape drives failed, changers failed, jobs failed. We rarely if ever went a weekend without getting called in to tweak, repair or restart backup hardware or software. It was a pain. One of the many times that the hardware vendor was on site replacing drives or changer parts, I asked the tech:

"Does everyone hate backups as much as I do?"

The answer:

"Yep."

So backups suck, but like it or not they are an essential, and perhaps the most essential part of system administration. If you can't recover from failure, then you are not a system manger or system administrator. Change your title. The one you've been using isn't appropriate.

Here are my thoughts on backups.
Key concepts:

RPO and RTO. If you don't know what they mean, start googling. If you know what they mean, then you should also know what they are for each of your applications. If you have no formal SLA's covering recovery, you should at least have informal agreements between you and your mangers and your customers as to what expectations are for recovery points and recovery times for storage, server and site failures. If you don't know what the expected RPO and RTO are for your applications, you've got a problem. You can't really make backup and recovery decisions without at least some idea of what they might be. At the very least, make up an RPO and RTO and let your boss know what they are. A little CYA doesn't hurt.

Backup versus Archive. You can Google that phrase ('backup versus archive') and gets good definitions. The way I define them, a backup exists to permit recovery from system failure to the most recent recoverable point in time in a manner that meets recovery point and recovery time objectives. An archive exists to permit recovery to points in time other than the most recent recoverable point in time. By those definitions, any backup older than the most recent backup is an archive. In general, backups protect you from physical failures. Archives protect you from logical failures.
Backups

You have a valid backup when….

1. The backup is on separate spindles and controllers from the source data.
2. The backup is off site.
3. The backup is tested by successfully restoring data.

If it isn't on separate controllers and spindles then it's not a backup. It might be a copy of the data that protects you against certain failure modes, but it's not a backup. RAID 1, 5, 6, 10, 0+1, 1+0, or whatever are not substitutes for backups. Controllers fail. I've personally experienced a handful of controller failures that resulted in scrambled data. The failed controller will scramble both halves of the mirror and all of the RAID set. So a database dump to a LUN on the same SAN as the database isn't a backup until it is swept off to tape or copied to a disk pool on some other controller/spindles.

If it isn't off site, it's not a backup. If you have a stack of tapes that will get sent off to the data vault when the company that does that shows up at 10am Monday, those tapes will be a valid backup at 10:05am Monday. Until then, they are a copy of your data, not a valid backup.

If you haven't tested it, it's not a backup. Think about that. One of the things I've done to drive home the importance of backups is to walk up to a sysadmins cube and ask them to delete their home directory. I'm the boss, I can do that. Trust me, its fun. :-) If they hesitate, I know right away that they don't have confidence in their backups. That's bad – for them, for me, and for our customers.
Archives

A backup is not an archive, and an archive is not a backup. In my world, an archive permits recovery to points in time other than the most recent recoverable point in time. Perhaps because of a regulatory requirement, you need to be able to recover files or databases as they were a month, a year or a decade ago. Then you need archives. If you don't have regulatory or other retention requirements, an archive still protects you against 'logical failures'. For example an archive provides protection against file deletion or corruption that went undetected for a period of time, or protection against accidental or intentional deletion or destruction of data.

But you likely can design a system where backups and archives use the same hardware and software, and in many cases, a backup can become an archive. In the common Grandfather-Father-Son (GFS) tape rotation, the full backups become archives as soon as the next full backup is finished. At that point in time, the full backup is no longer protecting you against server, storage or site failure. It's too old for that. But it is still protecting you against logical failure (a file or database that got corrupted or deleted, but went undetected for a period of time.)
Snapshots, Replication and Log Shipping.

Vendors are more than happy to sell us tools and toys that solve all our problems, but do they really? When do snapshots and various replication strategies protect us against physical and logical failures?

It depends.

We replicate some data (actually 25 million files) to an off site location using an OEM'd version of Doubletake. The target of the replication is a fully configured cluster on separate SAN controllers miles away from the source. That copy of the data protects us against site, storage and server failure (it's our backup). But when a customer hits the 'press here to delete a half-million files' button that the software vendor so graciously provided (logical failure), the deletes get replicated in a couple seconds. The off site replica doesn't help. Those files are recovered from an archive (last nights incremental + last weekends full), not from the backup (the real time replica).

Another example is the classic case where a user or DBA deletes a large number of rows from a table or does the old 'DROP TABLE' trick. If you've configured log shipping or some other database replication tool to protect yourself against site, server or storage failure, you'll replicate the logical failure (the deletion or drop) faster than you can blink, and your replicas will also be toasted. The replication technology will replicate the good, the bad and the ugly. It doesn't know the difference. You need to be able to perform a point-in-time recovery to a point before the damage was done, and replication alone doesn't provide that. Transaction logs, archive logs and similar technologies provide the point in time recovery.

Snapshots tend to complement replication. In general, a snapshot of a disk that is stored on the same controllers protects you against logical failure (it's an archive), but not against site, server or storage failure (it's not a backup). The snap gives you a point in time that is recoverable against logical failure, but not physical failure.
Conclusion

Whatever you have for a backup and archive system, keep in mind

* physical and logical failure
* recovery point and recovery time

And make sure you understand how you will recover from the failure modes within the recovery time to the recover point.

Then – because I teach at a local college, I get to give you all an assignment. It's got two parts:

1. Delete your home directory
2. Recover it from backup

Let me know how you did.

Michael Janke Last In, First Out

  • Anonymous

    I have a counterpart on another site, who says they have never backed up their virtual machines as they use raid arrays.

    when i heard this i think i was to stunned to actually say anything.
    i did try to point out that the redundancy in raid will not stop a file being deleted or corrupted, but it was like talking swaheale to a goat (i.e. a waste of breath)

    there will always be people who think their hardware is infalable (just like my spelling ;p) and will never change their mind until it bites them square in the arse.

    i personally prefer the paranoid backup policy, 2 is 1, 1 is none. when they work, my weekly backups run twice, one tape then goes off site, the other into the fire safe. We sort of chance it for the nightly differentials, they stay in the tape robot.

  • Will

    Awesome!

    I started out in this industry in tech support for a backup software company, so I like to think I'm aware of the importance of them, and that I'm aware of the weakness of the various components used for redundancy and backup. There is an art to imagining the modes of failure in a complex system, and how to protect against them all.

    Fortunately I've never had responsibility for a system failure that didn't have a good backup on tape or disk nearby. But I've come close...

    RAID is comforting but obviously not infallible. I've witnessed a drive failure during the rebuild of a RAID5 array after replacing the first failed drive. Poof. Data unusable.

    I recovered nicely from a backup that was stored on a network shared disk. But it was one day out of date as the most recent backup didn't complete successfully! Fortunately the users had cached email clients, and their user docs were also synch'ed to their desktop. The only thing lost were some new pictures stored in a shared folder that wasn't syn'ced to a desktop.

    Back when I was in tech support I did witness an incident where a backup that was reported successful, but the tape was for some reason not readable by the backup and recovery software.

    And even if your RAID array is intact, if the server motherboard goes code black then where are you going to put the drives? You'll probably have to find another server that has the same controller...if you can replace the motherboard, it probably needs to be the same with the same firmware. Maybe its doable but its not going to be quick...

    The problem for small business and home users is that backup and recovery are complex, expensive, and error prone. They need constant babysitting, updating, verifying.

    After racking my brain and analyzing various alternatives I came to a system for several side businesses that I work for that involves using Windows NTbackup utility and saves the backup file to a networked drive each night--full backups. And it keeps at least three copies online.

    This eliminates the crappy optical disk writers, expensive robots and tapes, and requires little manual intervention. If the network is sound, the destination network drive is in a stable pc, it is pretty much blow and go.

    Now what happens if the building burns down? I've employed Mozy.com to back up to their internet data center. I think they charge a nominal monthly fee of 7 or 8 dollars, plus 50 cents per gigabyte. Their software client is very configurable, easy to use, throttles performance and bandwidth, and it encrypts the data before shipping it out. Nightly reports can be sent out and it backs up Exchange, AD, and SQLServer. Restores can be done in various ways--individual files are cataloged and can be restored from Explorer. Bigger restores can be downloaded anywhere on the internet, and a full server can be shipped out via FedEx on DVD's. I don't mean to sound like an advertisement--I like Mozy and I haven't really analyzed other internet backup sites--I know there are more and more out there every year and the prices are getting very reasonable.

  • Michael Janke

    Thank's for the comments.

    There's been a couple dot-com's that already are or will be dot-bomb's simply because they didn't have backups when they needed them. (The Magnolia bookmark site being the latest). Imagine a small group of dedicated dot-com'ers spending a couple years of 16 hour days building a cool service, thinking that they'd eventually sell out & get the big payout.

    Then they throw it all away on something stupid like backups.

    I've been on a 'backup rant' lately. ;)

  • chewyfruitloop

    Backup needs to be a sysadmin religion

    If you haven't got it you need saving brother (sisters too I guess, but let's face it, must of us arn't ladies)!!

  • Anonymous

    All these warnings and considerations are great, as someone winging it without a lot of admin training, I appreciate the insights. But for a small operation, with a very small budget, is there an overkill point? If I have a Raid1 for the server and a data/document copy nightly to independent disk, isn't fooling with finicky tapes a time waster? I mean this is a SMALL operation, and if a fire would kill the business anyway, is offsiting data a waste? And as I would love to test a backup, can you direct me to more info on how to do that? Not trying to get anyone up in arms, I'm asking an honest question.

  • Matt

    @Anonymous

    I should probably let Michael respond, since he wrote the original piece, but I'll throw my $0.02 in..

    There is an overkill point. You hit that whenever the cost of the backup solution exceeds the value of the data.

    I've got a small operation, depending on how you look at it. There are less than 20 people in my company. The data is literally worth billions of dollars, though.

    Ask yourself what would happen if you lost the data. If you don't care if you lose the data, then you don't need a backups. If losing the data would be a minor inconvenience to the company, then store it on another device. However, if the company's existence depends on the data, you need to step it up a notch.

    Also, backups really fall into two categories: backups and archives.

    Backups ensure that you can recover your data in the event that the data store is incapacitated. This could be the machine crashing, the site catching on fire, or whatever.

    Archives make sure that you can recover data from a previous point in time. Suppose you found out that your data has been corrupted for some period of time. Archives let you go back and recover from a time before the corruption.

    Depending on the importance of the data, you should decide how extensive and frequent your backup / archive solution needs to be.

    It should also be reiterated that RAID isn't a backup solution, it's outage prevention. With a single drive, if the drive dies, you lose the ability to access the data. With RAID 1, you can continue to access the data, even with the degraded RAID array.

    Does any of that make sense? Feel free to email me at [email protected] if you don't want to continue in the thread.

  • Michael Janke

    If I have a Raid1 for the server and a data/document copy nightly to independent disk,

    I'd say that in some cases that is a valid backup. And if you keep a few of those copies on independent disks for a few days or weeks, then you'd got a sort of an archive too.

    If your recovery process would be to install a new server, install the apps from original CD's and copy the application data from the backup disk to the data directory of the applications, then you probably have a recovery process also. That doesn't sound too hard to test to me.

    As far as off site goes, it may be true that a fire would kill the company, but how about a sprinkler accidentally going off? Or a construction worker sawing off a live water pipe accidentally? Or a burglary? A storm that lets water in the building and floods the server room? A simple roof leak above the server room? A lightning strike that smokes anything that is plugged in? (I've personally seen all of these in our organization at one time or another.)

    There are a thousands things that can do happen that kill hardware and data, not just fires.

    I'll bet that you can set something up with ibackup.com or a similar service that would let you off-site your data pretty cheaply.

  • Anonymous

    Thank you Matt and Michael for your responses, these are great insights. Very helpful for me to rethink my approach to these topics.

  • Matt

    @anonymous

    Always happy to help, and thanks for reading and commenting!

  • Jeff

    "2. The backup is off site."

    I'd like to amend this to read, "the backup is off-site and off-line."

    To make my case, I will cite the following cases (Webhostingtalk and AVsim) where malicious users deleted data from an offsite, online backup server before trashing the main copies of the data:

    http://www.webmasterworld.com/community_building/3879428.htm

    http://www.theregister.co.uk/2009/05/15/avsim_destroyed/

  • Pingback: Yes, backups still suck…and here’s why | Standalone Sysadmin

  • Pingback: Setting up a new backup scheme - Just just easy answers