Overselling services in the datacenter

There’s a great post on Storagebod today called “Living on a prayer“. The topic of the discussion is overselling storage space, via thin provisioning and data deduplication.

I commented at the article, but after thinking about it, I suspected some of my readers here would be interested as well. As I mention, overallocation of resources is not anything new, and it’s not a matter of subterfuge as much as statistics. It’s the same math that powers whether a product is recalled:

amount of payout x number of payees <> amount of recall

If the number on the left is bigger, then the company does the recall.

Even though it’s “industry practice” in nearly every industry, it doesn’t make it “right”. IT is different in a lot of ways, and at certain times, there are very important things relying on the systems we admin. My friend Tom, who used to admin at a hospital, knows exactly what I mean.

Storagebod’s article does highlight some of the dangers, not necessarily in enterprise storage, but in the way we use it. Here’s the text of my comment on there.

“Just a Storage Guy” (other commenter) was right when he said that oversubscription happens all over, but not just in the datacenter. How many times have we seen service providers oversubscribe network links, airlines oversubscribe planes, and ticketmaster oversell venues?

I think it has become less of a problem of planning and more of a problem of statistics. In the analog world, are the benefits of oversubscription such that it is financially in our best interest to continue these practices, or will lash-back from the consumer be overbearing?

In the digital world, what is the statistical likelihood of the “perfect storm” happening, and most importantly, how does that statistic interrelate with our guaranteed uptime requirements? Of course, IT is the only field I know of where one-in-a-million occurrences happen every day.

In the end, it all boils down to this: “Risk Management” is not “Risk Elimination”.

Please share your thoughts (and practices, if possible). I’m interested to see where everyone stands.

  • bluehavana

    It will be interesting with cloud services and overselling. How “expandable” will cloud computing be when Micheal Jackson dies again… well… you get my point. If all web apps (or any sort of cloud based service) in a provider’s cloud get hit at once, exactly how expandable is cloud computing.

  • sysadmin1138

    Where I get tetchy is when the usage algorithms don’t allow enough ‘slop’ space to deal with unexpected demand rates. Some systems are more tolerant of that, and others most definitely do not tolerate that. That’s part of the risk-analysis that needs to happen when these systems are built. Run too close to the wire and you can support higher loads on less hardware, only to go catastrophically down when usage unexpectedly spikes over your max capacity.

    Over-subscription is most efficient when the operator has a perfect demand model for their environment. Of course there is no such thing as a perfect demand model, so the system has to take into account unprofitable overhead to handle unanticipated excess demand. The amount of such overhead should be regulated by another risk-assessment model about the likelihood of excess-usage events. Taken as a whole, and you have a fairly efficient system that’ll hit the wall only once in a great while.

    A good IT example of oversubscription is a file server with an HSM system in place. The HSM system ensures that only files that have been accessed ‘recently’, for defined values of recent that are determined by filesystem analysis, are on the fast, expensive, frequently backed up storage. The not accessed stuff is on slow, cheap, infrequently backed up storage. This is cost efficient as the ‘fast stuff’ could be costing $25/GB, and the ‘cheap stuff’ could be costing $4.95/GB. The amount of fast storage you need scales with the active data-set on your servers. Works great!

    Until Google Desktop came around. Suddenly, desktops were actively indexing things like user home directories, and shared volumes. Each indexing pass counted as an ‘access’. What’s more, if lots of users were indexing the same shared directory, enough ‘accesses’ could occur to cause the HSM to demigrate that 8 year old budgeting spreadsheet from the cheap stuff to the fast stuff. And suddenly, the growth models for the fast stuff are blown out of the water. Users start getting ‘unable to save file’ errors, even though the system shows lots of free space out there.

    This is a good example of an, “unknown unknown,” biting IT in the butt. The advent of ubiquitous file-indexing on desktops has forced the HSM vendors to change how they allow storage to be managed.

  • In the analog world, we have examples of Bernie Madoff oversubscribing investors.