Data Longevity, VMware deduplication change over time, NetApp ASIS deterioration and EMC Guarantee

Hey guys, the other day I was having a conversation with a friend of mine that went something like this.

How did this all start you might say?!? Well, contrary to popular belief, I am a STAUNCH NetApp FUD dispeller.  What that means is, if I hear something said about NetApp by a competitor, peer, partner or customer which I feel is incorrect or just sounds interesting; I task it upon myself to prove/disprove it because well frankly… People still hit me up with NetApp questions all the time :) (And I’d like to make sure I’m supplying them with the most accurate and reflective data! – yea that’s it, and it has nothing to do with how much of a geek I am.. :))

Well, in the defense of the video it didn’t go EXACTLY like that.   Here is a little background on how we got to where that video is today :)   I recently overheard someone say the following:

What I hear over and over is that dedupe rates when using VMware deteriorate over time

And my first response was “nuh uh!”, Well, maybe not my FIRST response.. but quickly followed by; “Let me try and get some foundational data”  because you know me… I like to blog about things and as a result collect way too much data to try to validate and understand and effectively say whatever I say accurately :)

The first thing I did was engage several former NetApp folks who are as agnostic and objective as I am to get their thoughts on the matter (we were on the same page!)Data collection time!  

For Data Collection… I talked to some good friends of mine regarding how their Dedupe savings have been over time because they were so excited when we first enabled it in the first place (And I was excited for them!)   This is where I learned some… frankly disturbing things (I did talk to numerous guys named Mike interestingly enough, and on the whole all of those who I talked with and their data they shared with me reflected similar findings)

Disturbing things learned!

Yea I’ve heard all the jibber jabber before usually touted as FUD that NetApp systems will deteriorate over time in general (whether it be Performance, whether it be Space Savings) etc etc. 

Well some of the disturbing things learned actually coming from the field on real systems protecting real production data was:

  • Space Savings are GREAT, and will be absolutely amazing in the beginning! 70-90% is common… in the beginning. (Call this the POC and the burn-in period)
    • As that data starts to ‘change’ ever so slightly as you would expect your data to change (not sit static and RO) you’ll see your savings start to decrease, as much as 45% over a year
    • This figure is not NetApp’s fault.  Virtual machines (mainly what we’re discussing here) are not designed to stay uniformly the same no matter what in accordance to 4k blocks, so the very fact that they change is absolutely normal so this loss isn’t a catastrophe, it’s a fact of the longevity of data.
  • Virtual Machine data which is optimal for deduplication typically amounts to 1-5% of the total storage in the datacenter.   In fact if we want to lie to ourselves or we have a specific use-case, we can pretend that it’s upwards of 10%, but not much more than that.  And this basically accounts for Operating System, Disk Image, blah blah blah – the normal type of data that you would dedupe in the first place.
    • I found that particularly disturbing because after reviewing the data from these numerous environments… I had the impression VMware data would account for much more!   I saw a 50TB SAN only have ~2TB of data residing in Data stores and of that only 23% of it was deduplicating (I was shocked!)
    • I was further shocked that when reviewing the data that over the course of a year on a 60TB SAN, this customer only found 12TB of data they could justify running the dedupe process against and of that they were seeing less than 3TB of ‘duplicate data’ coming in around 18% space savings over that 12TB.    The interesting bit is that the other 48TB of data just continued on un-affected by dedupe.   (Yes, I asked why don’t they try to dedupe it… and they did in the lab and, well it never made it into production)

At this point, I was even more so concerned.   Concerned whether there was some truth to this whole NetApp starts really high in the beginning (Performance/IO way up there, certain datasets will have amazing dedupe ratios to start) etc. and then starts to drop off considerably over time, while the EMC equivalent system performs consistently the entire time.

Warning! Warning Will Robinson!

This is usually where klaxons and red lights would normally go off in my head.    If what my good friends (and customers) are telling me is accurate, it is that not only will my performance degrade just by merely using the system, but my space efficiency will deteriorate over time as well.    Sure we’ll get some deduplication, no doubt about that!  But the long term benefit isn’t any better than compression (as a friend of mine had commented on this whole ordeal)    With the many ways of trying to look at this and understand I discussed it with my friend Scott who had the following analogy and example to cite with this:

The issue that I’ve seen is this:

Since a VMDK is a container file, the nature of the data is a little different than a standard file like a word doc for example.

Normally, if you take a standard windows C: – like on your laptop, every file is stored as 4K blocks.  However, unless the file is exactly divisible by 4K (which is rare), the last block has just a little bit of waste in it.  Doesn’t matter if this is a word doc, a PowerPoint, or a .dll in the \windows\system32 directory, they all have a little bit of waste at the end of that last block.

When converted to a VMDK file, the files are all smashed together because inside the container file, we don’t have to keep that 4K boundary.  Kind of like sliding a bunch of books together on a book shelf eliminating the wasted space.  Now this is one of the cool things about VMware that makes the virtual disk more space efficient than a physical disk – so this is a good thing.

So, when you have a VMDK and you clone it – let’s say create 100 copies and then do a block based dedupe – you’ll get a 99% dedupe rate across those virtual disks.  That’s great – initially.  Netapp tends to calculate this “savings” into their proposals and tell customers that require 10TB of storage, that they can just buy 5TB and dedupe and then they’ll have plenty of space.

What happens is, that after buying ½ the storage they really needed the dedupe rate starts to break down. Here’s why:

When you start running the VMs and adding things like service packs or patches for example – well that process doesn’t always add files to the end of the vmdk.  It often deletes files from the middle, beginning, end and then  replaces them with other files etc.  What happens then is that the bits shift a little to the left and the right – breaking the block boundaries. Imagine adding and removing books of different sizes from the shelf and making sure there’s no wasted space between them.

If you did a file per file scan on the virtual disk (Say a windows C: drive), you might have exactly the same data within the vmdk, however since the blocks don’t line up, the block based dedupe which is fixed at 4K sees different data and therefore the dedupe rate breaks down.

A sliding window technology (like what Avamar does ) would solve this problem, but today ASIS is fixed at 4K. 

Thoughts?

If you have particular thoughts about what Scott shared there, feel free to comment and I’ll make sure he reads this as well; but this raises some interesting questions.   

We’ve covered numerous things in here, and I’ve done everything I can to avoid discussing the guarantees I feel like I’ve talked about to death (linked below) so addressing what we’ve discussed:

    • I’m seeing on average 20% of a customers data which merits deduping and of that I’m seeing anywhere from 10-20% space saved across that 20%
      • Translation: 100TB of data, 20TB is worth deduping reclaiming about 4TB of space in total; thus on this conservative estimate you’d get about 4-5% space saved!
      • Translation: When you have a 20TB data warehouse and you go to dedupe it (You won’t) you’ll see no space gained, with a 100% cost across it.
        • With the EMC Unified Storage Guarantee, that same 20TB data warehouse will be covered by the 20% more efficient guarantee (Well, EVERY data type is covered without caveat)   [It’s almost like it’s a shill, but it really bears repeating because frankly this is earth shattering and worth discussing with your TC or whoever]

For more great information on EMC’s 20% Unified Storage Guarantee – check out these links (and other articles I’ve written on the subject as well!)

EMC Unified Storage is 20% more efficient Guaranteed

I won’t subject you to it, especially because it is over 7 minutes long, but here is a semi funny (my family does NOT find it funny!) video about EMCs Unified Storage Guarantee and making a comparison to NetApp’s Guarantee.   Various comments included in the description of the video – Don’t worry if you never watch it… I won’t hold it against you ;)

Be safe out there, the data jungle is a vicious one!   If you need any help driving truth out of your EMC or NetApp folks feel free to reach out and I’ll do what I can :)

SPOILERS!!!

 

You didn’t think I’d leave it that easily!   I definitely encourage conversation and engagement and absolutely want you to; some of you are you going to read what I said (or completely disregard or gloss over it) and say “I GET AMAZING DEDUPE RESULTS TODAY LOOK AT WHAT I HAVE!” or “MY ENVIRONMENT IS DEDUPING X Y AND Z AT J PERCENT” etc.  No, I get it, I do.   You had BETTER be seeing massive deduplication space savings in your VMware environment.  In fact, I hope you are seeing 10-20% savings in your SQL environments which you’re not compressing by default (Why are you deduping SQL? Yea I know we’re crazy, lets get past that!) and your file data, you’ll be seeing what.. 35-45% deduped there per volume you have allocated or broken down into your CIFS or NFS structures?   I know for the most part exactly what type of space savings you SHOULD see out the gate, and at a fairly decent level what you would typically be seeing over the span of x months, years (etc rate of change period, etc).  Not to mention that I personally love the algorithm which is employed, I think it’s really cool functionally and understand it far more than most people should.   Though this is not the panacea of all things.   And with that in mind I want you to be combative if you are defensive about the savings you’ve gained; ESPECIALLY if you are questioning where that may bode for the future.   I hate nothing more than making a promise “Look! You’ve saved 80% today, so it can only stay at 80% or above for the future right?” only to find them shattered because there honestly wasn’t enough historical data let alone usage patterns to dictate what the future would look like.    

As I always have, I encourage you to question, confirm, validate and question again.   I drive a Prius and NetApp is a little like a Prius in a number of ways.   In optimal conditions you can get an expected MPG, and pretty much assume you’ll get it.   Unless you go up hill, or down hill, or it’s too hot out or it’s too cold out; though otherwise you can pretty much expect to get this same MPG! (Even if my MPG deviates by as much as 30% given any number of conditions)     EMC has always been conservative.  If they say you’ll get ‘x’ you have the expectation, nay a guarantee that you will be obtaining ‘x’, that’s something they’ve always been very good at which is why EMC is where information lives. :) (bad tagline reference… ;))