Enterprise Strategy Group | Getting to the bigger truth.TM
Search

It’s Not About Reduction Ratios: The Real Impact of Global Deduplication

As deduplication takes hold and IT organizations gain maturity in its use, it’s only natural that users will seek greater optimization of the technology.  One would think that global deduplication would be on that optimization checklist.  Curiously, recent ESG data protection research found that the ability to deduplicate across systems, as opposed to just within a system, ranked near the bottom of respondents’ evaluation and selection criteria—well below their more pressing concerns such as cost, ease of implementation/use, impact on backup/recovery performance, integration with existing backup processes, and scalability.

What is global deduplication?  Well, we know that deduplication in backup finds redundant data and ensures that only unique data is written, storing data more efficiently.  Identification of duplicates can occur in two ways: within a single domain (backup data passing through an individual system is compared with data passing through the same system) or across domains (backup data passing through an individual system is compared with data passing through the same system as well as other systems in the domain).  The latter, global deduplication, can result in higher deduplication ratios since there are more comparisons and, therefore, more chances to find replicate data.

Global deduplication is often the byproduct of the underlying architecture—or approach—of the solution, such as source-side software and grid- or clustered node-architecture approaches, performing redundancy checks. Backup software solutions with deduplication offered at the global level include: CommVault Simpana, EMC Avamar, and Symantec NetBackup PureDisk.  Backup target devices with global deduplication include: Exagrid EX Series, FalconStor FDS, HP VLS, IBM ProtecTIER, NEC HydraStor, and Sepaton DeltaStor.

Are the potential higher reduction ratios afforded by global deduplication the real benefit?  It’s a benefit, but probably not the biggest one.  Poor reduction ratios are not the real issue with solutions that lack global deduplication.  It’s more about the inefficiency of managing the silos that develop in local-only deduplication solutions.

Data growth rates in the 10-30% range are the norm.  Deduplication helps stem the need to scale the backup environment.  However, at some point, backup throughput and/or capacity requirements are going to “break” the solution. Adding more backup infrastructure can get you back on track, but it introduces a new problem: more points of management.

What the global deduplication approaches have in common is a multi-node architecture with the ability to manage multiple deduplication systems as one.  Throughput scalability, high availability, and load balancing benefits of these architectural approaches should be the callout.  It’s these features that reduce administrative overhead—oftentimes the larger burden in the backup TCO equation—and, importantly, tie back into the aforementioned top criteria for purchasing deduplication, including cost, ease of use, performance, and scalability.

  • Share/Bookmark

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

Tags: , , , , , , , , , , , , , , ,

All views and opinions expressed in ESG blog posts are intended to be those of the post's author and do not necessarily reflect the views of Enterprise Strategy Group, Inc., or its clients. ESG bloggers do not and will not engage in any form of paid-for blogging. Click to see our complete Disclosure Policy.

3 Responses to “It’s Not About Reduction Ratios: The Real Impact of Global Deduplication”

  1. Great post – I like the insight that global deduplication, while leading to better data reduction ratios, is equally or more of a total cost of operations play. For primary storage, which is where a lot of the deduplication-dedicated device vendors seem to want to go, this is badly needed.

    One question that might be asked is that if federation is to occur in the backup stream what the pros and cons of it occurring upstream at the metadata processing stage in the ingest function or downstream in the actual storage of the data itself. By the nature of a deduplication storage vendor, it’s of course going to occur at storage. If you vertically integrate, it’s possible to move that federation higher in the ingest stream so to increase both parallelism and (less often) redundancy.

    It’s a fascinating area with new developments – and of course the continued improvement in disk density, 6Gb commodity interconnects, and price per gigabyte falling, and the like adds a twist as well.

  2. Fabrice Helliker says:

    Yes I wholly agree the benefit of global de-duplication is overstated. Global de-duplication appeals to everyone’s ideal view of the world where unwanted redundancy is removed. The truth, as you stated, is unless it is a bi-product , it’s a waste of effort. What global de-duplication provide you is the ability to remove instances of a data that appear on multiple systems. Sounds like a big deal but it isn’t where the de-duplication really benefits. De-duplication savings occur when you remove the duplicated instances of backups. De-duplicating 10 full backups will almost give you 10x data reduction. De-duplicating a few common blocks between systems is a drop in the ocean compared to this.

    Don’t get me wrong as a vendor that offers the ability to de-duplicate data that appears across multiple systems I wont bury the feature. But I am pleased that this is been bought up and should be put in perspective to where the actual savings occur.

    What would be nice now is to move on from global dedupe , in-line/off-line dedupe, sliding/fix block dedupe, etc discussion and address the question to why we are duplicating data in the first place? Backups duplicate data and devices de-duplicates it. Data de-duplication is a band aid. The de-duplication process is expensive yet the same data efficiency savings can be had without the overhead by preventing duplication in the first place.

    Fab
    http://www.cofio.com

  3. John Chaves says:

    There is general confusion over the whole de-duplication ratio discussion, i.e 5x, 10x, 20x, … I have seen some marketing from big firms with three letters claiming 50, 60x reduction. This is all misleading as far as mathematics. If you have any data set and you deduplicate it, the law of small numbers comes into play. Simply put, at 2:1 ratio, you have halved the size of the data set, at 5:1 you have reduced it 80%, at 10:1 you have reduced your data set 90%. The effort to reduce the remaining 10% of data
    yields smaller reductions. If you want to get to 20:1 deduplication on a data set, 95% percent of your data is de-duplicated but it yields only a
    5% reduction in data from the 10:1 reduction of 90%.
    The misleading marketing comes from the subsequent data sets, i.e. the second, third, fourth and depending on the retention of data the next 30 to 60 days backup. The only deduplication occuring is on new data. The blocks that have already been ingested and deduplicated are discarded and given pointers. The epiphany here is that those new de-duplicated blocks that have been seen before and given pointers to identical blocks are now just SINGLE INSTANCE STORES, SIS. The deduplication ratio gives way to regular old SIS in trying to reach 100% deduplication and thus block level deduplication ratios greater than 20x should not be mathematically considered valid. The ability to store and manage more SIS pointers to the deduplicated data store is thus what the marketing materials of 50-60x or more reductions are referring to. I hope this clears up any ambiguity to the ratios

Add a comment

Switch to our mobile site