Tag Archives: guest blog

Boy are my arms tired!

Well, I’m back in the States, and working on getting back into the groove. I want to take a second and reiterate my thanks to the many guest bloggers who helped me out and contributed some really great information while I was gone. I really enjoyed learning from each one of the entries, and I can tell from your comments that they were well received. So thank you again, Ian, Bob, Nick, Ryan, Jeff, Phil, and Michael. You guys helped me out a lot, and I really appreciate it.

After Amy and I got back to Columbus, things got even crazier. Due to an (un?)fortunate alignment in the cosmos, we were scheduled to have the moving company come pack up all of our belongings the morning after we got home at 11:30pm from the trip. So I slept about three hours that night, while Amy pulled an all-nighter to get things ready. I don’t know how she did it. We got it done, though, and spent Saturday night with our friends Mike and Heather, since we didn’t have a bed at that point.

Sunday morning we got up and packed the things we needed into the car and left for New Jersey, and we’re there now. Our furniture should be here this morning sometime.

I started in the corporate office yesterday, and it seems like it’s going to be a lot of fun. My “todo” list is sitting around a dozen, and that’s just because I left it at work. I’ve come up with several other tasks since then. Since there’s no shortage of things for me to do, I’d better get started.

Thanks for reading, and I hope to return to normal blog posts tomorrow. By the way, I’ll be uploading photos of the trip to my flickr page, if you’re interested.


Backups Suck

Many thanks to Michael Janke for this blog entry

Years ago we had a period of time where we had nothing but problems with backups. Tape drives failed, changers failed, jobs failed. We rarely if ever went a weekend without getting called in to tweak, repair or restart backup hardware or software. It was a pain. One of the many times that the hardware vendor was on site replacing drives or changer parts, I asked the tech:

“Does everyone hate backups as much as I do?”

The answer:


So backups suck, but like it or not they are an essential, and perhaps the most essential part of system administration. If you can’t recover from failure, then you are not a system manger or system administrator. Change your title. The one you’ve been using isn’t appropriate.

Here are my thoughts on backups.
Key concepts:

RPO and RTO. If you don’t know what they mean, start googling. If you know what they mean, then you should also know what they are for each of your applications. If you have no formal SLA’s covering recovery, you should at least have informal agreements between you and your mangers and your customers as to what expectations are for recovery points and recovery times for storage, server and site failures. If you don’t know what the expected RPO and RTO are for your applications, you’ve got a problem. You can’t really make backup and recovery decisions without at least some idea of what they might be. At the very least, make up an RPO and RTO and let your boss know what they are. A little CYA doesn’t hurt.

Backup versus Archive. You can Google that phrase (‘backup versus archive’) and gets good definitions. The way I define them, a backup exists to permit recovery from system failure to the most recent recoverable point in time in a manner that meets recovery point and recovery time objectives. An archive exists to permit recovery to points in time other than the most recent recoverable point in time. By those definitions, any backup older than the most recent backup is an archive. In general, backups protect you from physical failures. Archives protect you from logical failures.

You have a valid backup when….

1. The backup is on separate spindles and controllers from the source data.
2. The backup is off site.
3. The backup is tested by successfully restoring data.

If it isn’t on separate controllers and spindles then it’s not a backup. It might be a copy of the data that protects you against certain failure modes, but it’s not a backup. RAID 1, 5, 6, 10, 0+1, 1+0, or whatever are not substitutes for backups. Controllers fail. I’ve personally experienced a handful of controller failures that resulted in scrambled data. The failed controller will scramble both halves of the mirror and all of the RAID set. So a database dump to a LUN on the same SAN as the database isn’t a backup until it is swept off to tape or copied to a disk pool on some other controller/spindles.

If it isn’t off site, it’s not a backup. If you have a stack of tapes that will get sent off to the data vault when the company that does that shows up at 10am Monday, those tapes will be a valid backup at 10:05am Monday. Until then, they are a copy of your data, not a valid backup.

If you haven’t tested it, it’s not a backup. Think about that. One of the things I’ve done to drive home the importance of backups is to walk up to a sysadmins cube and ask them to delete their home directory. I’m the boss, I can do that. Trust me, its fun. :-) If they hesitate, I know right away that they don’t have confidence in their backups. That’s bad – for them, for me, and for our customers.

A backup is not an archive, and an archive is not a backup. In my world, an archive permits recovery to points in time other than the most recent recoverable point in time. Perhaps because of a regulatory requirement, you need to be able to recover files or databases as they were a month, a year or a decade ago. Then you need archives. If you don’t have regulatory or other retention requirements, an archive still protects you against ‘logical failures’. For example an archive provides protection against file deletion or corruption that went undetected for a period of time, or protection against accidental or intentional deletion or destruction of data.

But you likely can design a system where backups and archives use the same hardware and software, and in many cases, a backup can become an archive. In the common Grandfather-Father-Son (GFS) tape rotation, the full backups become archives as soon as the next full backup is finished. At that point in time, the full backup is no longer protecting you against server, storage or site failure. It’s too old for that. But it is still protecting you against logical failure (a file or database that got corrupted or deleted, but went undetected for a period of time.)
Snapshots, Replication and Log Shipping.

Vendors are more than happy to sell us tools and toys that solve all our problems, but do they really? When do snapshots and various replication strategies protect us against physical and logical failures?

It depends.

We replicate some data (actually 25 million files) to an off site location using an OEM’d version of Doubletake. The target of the replication is a fully configured cluster on separate SAN controllers miles away from the source. That copy of the data protects us against site, storage and server failure (it’s our backup). But when a customer hits the ‘press here to delete a half-million files’ button that the software vendor so graciously provided (logical failure), the deletes get replicated in a couple seconds. The off site replica doesn’t help. Those files are recovered from an archive (last nights incremental + last weekends full), not from the backup (the real time replica).

Another example is the classic case where a user or DBA deletes a large number of rows from a table or does the old ‘DROP TABLE’ trick. If you’ve configured log shipping or some other database replication tool to protect yourself against site, server or storage failure, you’ll replicate the logical failure (the deletion or drop) faster than you can blink, and your replicas will also be toasted. The replication technology will replicate the good, the bad and the ugly. It doesn’t know the difference. You need to be able to perform a point-in-time recovery to a point before the damage was done, and replication alone doesn’t provide that. Transaction logs, archive logs and similar technologies provide the point in time recovery.

Snapshots tend to complement replication. In general, a snapshot of a disk that is stored on the same controllers protects you against logical failure (it’s an archive), but not against site, server or storage failure (it’s not a backup). The snap gives you a point in time that is recoverable against logical failure, but not physical failure.

Whatever you have for a backup and archive system, keep in mind

* physical and logical failure
* recovery point and recovery time

And make sure you understand how you will recover from the failure modes within the recovery time to the recover point.

Then – because I teach at a local college, I get to give you all an assignment. It’s got two parts:

1. Delete your home directory
2. Recover it from backup

Let me know how you did.

Michael Janke Last In, First Out

Software patching is the other benefit of virtualization

Many thanks to Philip Sellers for this blog entry!

Our sys admin group seems to be constantly grappling with software patches. We feel constantly behind and reactive to new patches and firmwares that are released and it’s a never ending cycle. Since I joined the company a little over 2 and half years ago, I’ve been asked to write a patch plan for our Windows servers twice, maybe three times. Unfortunately, we have never been able to make these patches happen consistently. We’ll make a big push to patch and that seems to break lots of things, which forces us to stop again and fall further behind. Lately, we don’t feel we have a choice but to apply some of the recent security holes that Microsoft has plugged. So we’re faced with what seems like a catch-22 and so we have pulled the trigger, and have bitten the bullet.

Our last patch push was handled very well, with virtually no problems arising from the patches applied, except with our Citrix farm. Citrix Presentation Servers didn’t like one or two of the patches – can’t really tell you what happened there, but I know we reverted to pre-patched disks because of problems. The good is that we are finally (mostly) up to date. The bad – the last push was handled almost 100% manually, which anyone in our field will tell you is NOT the way to patch. It’s too time consuming, monotonous and wasteful.

What is different today and what has allowed us to realistically look at automated patching today is our virtualization using VMware. Since the last patch plan I drew up, we’ve virtualized much of our datacenter. Also, most of our newer sprawl has been contained in virtual servers. Our datacenter today is about 80% virtualized to about 20% physical for Windows servers. We began investigating VMware’s Update Manager product several months ago and we’ve been really impressed with the results.

Every good patch plan has a few basics that have to be included, in my opinion. First, you have to know what patches need to be applied – so you need to connect to a patch repository. There are third party software solutions that do a great job of this for a broad group of software products. VMware’s Update Manager uses Shavlik to provide much of its update database. The second thing that plan should include is fail-back and recovery. There are times when patches just don’t provide the expected results and being able to revert is always critical. Third, you should be able to control the time updates are applied and minimize the amount of sys admin interaction required. For Windows, that can be accomplished via group policy and Active Directory structure or using a third party software like Update Manager. Fourth, you have to make room for exceptions. Every network has these, whether it’s the mission critical server that can’t afford downtime or it’s the self-important system with dictated uptime due to political reasons.

Since we have caught up to current patches on our systems, we’ve drafted a new patch plan in hopes of keeping up with never getting behind like that again. We settled upon using Microsoft’s WSUS (Windows Server Update Services) and VMware’s Update Manager as our two pronged solution. These two products hit our two major categories of Windows servers – WSUS physical and Update Manager for virtual servers. Both software allow for the approval of updates and reporting against the baseline of approved updates to see which systems require patching. From there, you can begin the remediation process to bring these systems up to the baseline.

Update Manager also brings the inherent benefits of virtualization to the table when patching is concerned. The Update Manager workflow and scheduler includes rollback snapshots with automatic removals to the workflows. This is a big capability, as we all well know that sometimes patches cause problems or even fail to install. The scheduling features are robust and allow for a fully customized rollout schedule while the administrator just sits back and watches the rollout occur. And, with any automation, there comes a small risk of missing something during the install, but so far, our experience is that the software reports back any problems so that you can give them attention individually. Its also a great solution for our DMZ since the updates are mounted as virtual CD’s and installed from this. It addresses the problem of patches filling up a server because they are downloaded, executed, but never cleaned up using Automatic Update. All in all, we feel like we’ve found a winner.