Will Your Data Benefit from Deduplication? Find Out By Testing Dedupe Rates with the EMC CAT Tool

Is deduplication really all it’s cracked up to be?

With everyone in the industry talking about deduplication, you can’t go 2 minutes without hearing how great it is and outlandish claims regarding deduplication rates. So the question is… is dedupe really all it’s cracked up to be? The answer isn’t really in the deduplication technology itself. It’s actually in the make-up of the data you’re looking to deduplicate. So how do you know if dedupe is the right technology for you?

Do you have a ton of highly-compressed images or multimedia files? These aren’t the ideal data types for deduplication.

Does your environment contain a lot of large databases like SQL, Exchange, Oracle, and Exchange? If that is the case, dedupe can help, but not as much as those crazy marketing numbers say.

Do you have large File Servers? Lots of VMware? Remote offices you need to backup? This is where dedupe really shines and you can really add some efficiencies in your environment. It is also where those numbers like 200:1 or 500:1 come from and can actually be beat in some cases.

Now of course your data doesn’t nicely fit into just one of those categories above. It likely spans two of them, if not all three. So we’re back to the original question…is dedupe the right technology for you? One of the keys to deploying an *effective* deduplication solution is to know where to deploy it, how to deploy it, and what to expect.

Just because you can deduplicate your live VMware or database environment doesn’t necessarily mean you should. There are a lot of implications to trying to deduplicate data that is frequently accessed and performance can severely suffer in some cases. While dedupe is a great technology, it can bring your environment to its knees if implemented incorrectly. I’ll address this in another post because that is a whole different topic. Today I want to focus on how to figure out if, and how, deduplication can benefit your business in the backup process.

Knowing where to deploy deduplication and what rates to expect can really only be determined through an assessment of your current environment. EMC has a great tool called the Commonality Assessment Tool (CAT) that will allow you to look at a subset of your data and see exactly what the commonality is. This tool can be downloaded from the IDS website for free here (click the link for the EMC CAT tool download).

So what is this tool and what can you expect from it? EMC offers a deduplication solution called Avamar, which is a backup software and backup-to-disk appliance all wrapped into one. The CAT Tool is essentially a modified Avamar client that will perform a simulated backup on your server(s) and instead of actually backing the data up, it just tracks what the deduplication rate is and how long the actual backup would have taken. I’ll quickly take you through the process of running this tool and show you how easy it is to figure out exactly how much commonality is in your data.

***One important thing to note before beginning is that the CAT Tool has the same impact on your system as a normal backup client. It is recommended to run it off-hours and not during your regular backup window.

  1. Download the CAT Tool from the IDS website (click for the deduplication rate test tool download). A link will be emailed to you where you can download a zip file containing the tool.
  2. Extract the zip file to C:Avasst. The directory will contain avtar.exe and avasst.exe. *Note: This directory must exist or the tool will not run correctly.
  3. Open a command prompt by going to Start/Run and running cmd.exe.
  4. Browse to the CAT directory by typing “cdAvasst”
  5. Run the CAT tool by typing “avasst”
  6. You will be prompted to select a folder to scan. In this example, I will scan the D Drive by entering “d:”. If you want to scan multiple folders at once, see the notes at the end.
  7. The first time you run the tool, you can expect the tool to take approximately 1 hour per 100GB of file and 1 hour per million files. However, subsequent runs will be much quicker due to deduplication.
  8. When the tool has completed running, you will see the following screen:
  9. Now if you look in the c:Avasst folder, you will see several files that are tracking the deduplication rates and backup times for your data. They are just raw data and need to be run through a tool in order to interpret the results.
  10. In order to see the full benefits of deduplication, you will want to run this tool against the same dataset several times (at least 3). You can also run it across several different datasets to see commonality across several servers. Since the commonality tracking is stored locally in the c:Avasst folder, you will want to mount directories from other servers and scan them from this server across the network.
  11. When you have scanned your datasets, zip the results and send them to your IDS Engineer to have the results interpreted.

Some other notes on the CAT Tool:

  • If you want to scan data from other servers, you can simply mount another server’s drive to a drive letter on the local server and scan that drive.
  • In a real deduplication solution, all data will be deduplicated globally against other servers and backup sets. Since the assessment tool tracks deduplication locally, you will need to scan all datasets from the same server to see global deduplication benefits.
  • The CAT Tool can easily be scheduled using the built-in Windows Scheduler. Those instructions are included in a Word document included with the CAT Tool download.
  • If you want to scan multiple folders at once, you will need to create a silent file that contains the folders you want to scan. Simply create a file named “Silent” with no extension in the c:Avasst folder. Inside of that file, just put a line for each drive you want to scan (see screenshot below)

**Note that you cannot end any entries with a backslash. For example, C: and D:Users are valid, but C: and D:Users are invalid.