All Posts By

Jon Austin

Tape Sucks: Avamar 6.0 Version #tapesucksmoveon

By | Avamar, Backup, Data Loss Prevention, Deduplication, EMC, Replication, Storage, VMware | No Comments

Since its release last week, there has been a lot of buzz around the Avamar 6.0. I am going to take the liberty of reading between the lines and exploring some of the good (but gory) details.

The biggest news in this release is the DDBoost/Data Domain integration into the new Avamar Client binaries. This allows an Avamar client to send a backup dataset stream to a DataDomain system as opposed to an Avamar node or grid. Datasets that are not “dedupe friendly”(too large for Avamar to handle, or have very high change rates) are typically retained for shorter periods of time. These can be targeted to a DD array, but still managed through the same policies, backup and recovery interface.

Client types supported pertaining to this release are limited to Exchange VSS, SQL, SharePoint, Oracle and VMWare Image backups. Replication of data is Avamar to Avamar and DataDomain to DataDomain: there isn’t any mixing or cross-replication. Avamar coordinates the replication and replicates the meta-data so that it is manageable and recoverable from either side. From a licensing perspective, Avamar requires a capacity license for the Data Domain system at a significantly reduced cost per TB. DDBoost and replication licenses are also required on the Data Domain.

There is a major shift in hardware for Avamar 6.0: 

  1. The Gen4 Hardware platform was introduced with a significant increase in storage capacity.
  2. The largest nodes now support 7.8TB per node – enabling grids of up to 124TB.
  3. The new high capacity nodes are based off of the Dell R510 hardware with 12 2TB SATA drives.
  4. To speed up indexing the new 7.8TB nodes also leverage an SSD drive for the hash tables.
  5. There are also 1.3TB, 2.6TB and 3.9TB Gen4 Nodes based off of the Dell R710 hardware.
  6. All nodes use RAID1 pairs and it seems the performance hit going to RAID5 on the 3.3TB Gen3 nodes was too high.
  7. All Gen4 nodes now run SLES (SUSE Linux) for improved security.

There were several enhancements made for grid environments. Multi-node systems now leverage the ADS switches exclusively for a separate internal network that allows the grid nodes to communicate in the event of front-end network issues. There are both HA and Non-HA front end network configurations, depending on availability requirements. In terms of grid support, it appears that the non-RAIN 1X2 is no longer a supported configuration with Gen4 nodes. Also, spare nodes are now optional for Gen4 grids if you have Premium Support.

Avamar 6.0 is supported on Gen3 hardware, so existing customers can upgrade from 4.x and 5.x versions. Gen3 hardware will also remain available for upgrades to existing grids as the mixing of Gen3 and Gen4 systems in a grid is not supported. Gen3 systems will continue to run on Red Hat (RHEL 4).

Avamar 5.x introduced VStorageAPI integration for VMWare ESX 4.0 and later versions. This functionality provides changed block tracking for backup operations, but not for restores. Avamar 6.0 now provides for in-place “Rollback” restores leveraging this same technology. This reduces restore times dramatically by only restoring the blocks that changed back into an existing vm. The other key VMWare feature introduced in version 6.0 is Proxy server pooling – previously, a proxy was assigned to a datastore, but now proxy servers can be pooled for load balancing in large environments.

There were several additional client enhancements on the Microsoft front including Granular Level Recovery (GLR) support and multistreaming (1 to 6 concurrent streams) for Exchange and Sharepoint clients.

All in all, the Avamar 6.0 release provides several key new features and scales significantly further than previous versions. With the addition of Data Domain as a target, tape-less backup is quickly approaching reality.

Photo Credit: altemark

New Feature! EMC Replication Manager 5.3.1: Create SnapView Replicas of RecoverPoint Targets with Linked Jobs (#fromthefield)

By | Backup, Replication | No Comments

I recently ran across a new feature in EMC Replication Manager 5.3.1. Users now have the ability to create SnapView replicas of RecoverPoint CDP, CRR or CLR target volumes. This improves recoverability and resiliency significantly by reducing the time the target volume has to be in physical access mode for backups or remote side testing activities.

The impact of mounting a RecoverPoint replica has always been a challenge. You have these nice consistent copies of your data in a remote location, so wouldn’t it make sense to mount them to a backup server and back them up? Additionally, wouldn’t it be convenient to test that application patch prior to going live?

The downside regarding these functions is that within logged access or physical access mode, users have a limited amount of journal space for incoming replication, and by default an even smaller amount for target side writes. If either of these fill up, you lose journaling and have to re-sync, and/or lose access to the target side volume.

Previous to RM 5.3.1, the only option was to halt the replication, roll to physical access and perform your activities. This left users without replication for the duration of the testing/backup. If you were to take a snapshot or mirror using SnapView, you could resume the replication much sooner, but now we have all kinds of moving pieces to keep track of. That kind of complexity is sure to breed mistakes…

RM 5.3.1 addresses this issue head on with linked jobs.

Under the covers, all of the previously described complexities are still going on. It is orchestrated, though, so that nothing gets fat-fingered and the application’s consistent bookmark gets mounted, rolled to physical access, SnapView creates its replica, and that replica is mounted to a mount host.

There are two ways to configure SnapView replicas of RecoverPoint targets: two-phase jobs and copy jobs. With a two phase-job, RecoverPoint CDP or CRR is selected as the source and the SnapView snap or clone is selected as the replication technology. This method leverages the link job mechanism. The copy job references an existing RecoverPoint CDP or CRR job as the source of the SnapView replica. I realize this seems redundant, but there are many situations where you may want to separate jobs that are called by separate schedules.

If you want more details on any of these technologies or have any comments, please drop me a line below.

Photo Credit: Jeff Howard

Enterprise Flash Drive Usage Considerations

By | EMC, Storage | No Comments

EMC introduced Enterprise Flash Drives (EFDs) into their mid-range Clariion storage arrays last year with the promise of them being the ultimate “Tier 0” performance drives. They provided some initial guidance on configuration, but as we see with most technologies, these recommendations have shifted somewhat over the last 12 months. It is still true that EFDs can provide tremendous performance gains over Fibre Channel and SATA drives; however, it is important to understand the configuration parameters and workloads that benefit from flash drives.

The biggest improvement with EFDs is seen from small, random, read-intensive small block size workloads. These workloads thrive on flash drives because there is no mechanical seek time and they are serviced by multiple parallel threads from an individual device. It is not uncommon to see 5000-8000 IOPS per drive in these environments (when configured with 4k or 8k block sizes). When using a larger block size though, these IO rates go down dramatically, often matched by Fibre Channel drives, due to the restrictions of the drive interface.

This behavior requires us to re-think recommendations provided by Microsoft and Oracle to format database volumes with large sector sizes (i.e. 64K). These recommendations are made because the database providers want to cache large blocks in memory to get around poor disk performance, so they read in large sectors in the hope they will have page-hits in memory. If the nature of the application is truly random, however, small disk block sizes on flash drives can provide better performance. Additionally, because of their random read-heavy nature, database drives on EFDs won’t benefit much from having array controller read or write caching enabled.

The next consideration is whether or not to put database transaction log volumes on EFDs. The conventional wisdom when EFDs were released was to put TLog LUNs on RAID10 FC drives because logging activity is primarily sequential writes. The original guidance for EFDs was to turn off SP write caching in order to allow the write cache to be better leveraged for FLARE LUNs. What we have learned in the wild, though, is that for write intensive LUNs (like transaction log volumes) enabling write cache has a positive effect on write-intensive EFD LUNs. We see this because the Clariion cache strategy is to coalesce consecutive writes into complete stripes using the MR3 algorithm. These complete stripes are then flushed to the “disks” in the RAID group. These large full stripe writes work very well with the EFD buffered write mechanism.

So where wouldn’t you use EFD drives? Flash drives will perform well under most circumstances and will provide excellent performance as well; however, FC—and even sometimes SATA drives—can provide comparable performance in single threaded large block sequential operations like backup and media. Flash drives can hold par in these types of operations, but the cost/IOP and cost/GB does not make sense.

With the release and propagation of FLARE 30 FAST EFD, drives are going to provide a critical component to storage tiering both in Tier0 and in FLASHCache. This should simplify decisions regarding what should be on EFD “spindles” and what should be on FC and SATA.

SAN Storage Performance from the Host Perspective: Shedding Light on Confusing Counters

By | Storage | No Comments

Last month we looked at performance from a holistic high level view and the importance of assessing the stack of related components that make up performance. This month I hope to shed some light on some confusing performance counters.

We will look at Microsoft Windows hosts and the performance tools and counters that we need to look at to establish a baseline and gain an understanding of how to apply host symptoms to SAN storage devices. We will also look at a couple of best practice configuration recommendations to facilitate performance tuning.

For Microsoft Windows systems the tool most people use to asses how a system is performing is perfmon. It is a Microsoft console (msc) tool that provides real-time graphs and has the capability of logging performance counters. The performance counters that we will focus on primarily are the Physical Disk counters. It is important, however, to make sure you don’t ignore other key system performance factors such as Processor, Memory, and Network values as these can mask or exacerbate disk performance problems. There are third party tools that leverage CMI or SNMP interfaces, but we will look at those in future installments.

When looking at perfmon disk counters in a SAN environment the first thing you should ignore is the disk utilization counter. While not clearly documented, the perfmon disk utilization counter is calculated by taking 100*Avg Disk Q Length. Instead of disk utilization take 100 minus Percent Idle Time counter to get a true indication of disk utilization. As percent idle time approaches 0 the busier the disk is running. In the same category of counters, but deceptive in a SAN environment is Queue Length counters. Since the windows environment has no idea of how many drives are servicing a particular device, high queue length may seem to indicate a problem where there is none. An effective rule of thumb is 2*the number of spindles.

The next set of counters to look at relate to response time or how quickly a disk transaction is serviced. The perfmon counter to look at is seconds per transfer (or read/write for respective response times) Use a scale value of .001 or 1/1000 to graph values in milliseconds. While large I/O operations and spikes may take longer, this value should typically be under 10ms for properly running systems.

After looking at utilization and response times, the next important set of counters to look at is the read/write balance and operations that you need to size for. Perfmon provides the Physical Disk Reads/sec and Writes/Sec. With these values you can start to establish the number of drives required to support the required I/O load.

With these three categories of Windows performance counters you can start establishing a baseline and a comparison to the SAN performance counters further down the stack. As you look at these performance counters, make sure that your system is up to date with the latest HBA drivers, firmware, and OS patches/hotfixes as recommended by your storage vendor. This will ensure the best performance and support.

EMC Avamar Replication Guide: Sizing Your Bandwidth Properly for Offsite Backup

By | Avamar, Backup, EMC, Replication | No Comments

One of the most powerful features of Avamar software is its ability to replicate backup sets to another Avamar grid, providing offsite backups without the hassle of cutting and handling tapes. Avamar’s deduplication functionality not only reduces the amount of storage required but also the bandwidth required to replicate what is functionally the equivalent of a full backup. While the reduction in bandwidth is dramatic, the difference bytes still have to be sent across a WAN link. Sizing that bandwidth properly is critical to the effective operation of the Avamar System.

Read More

EMC Avamar Review from the Field: Backup Made Better with Source-Based Deduplication

By | Avamar, Backup, EMC | No Comments

Companies and organizations are often faced with challenges regarding their backup windows, making them concerned about their overall reliability. EMC Avamar is a powerful backup and recovery system that can provide your organization with the reliability of daily full backups in a fraction of the time required by a traditional tape and disk based backup solution.

Read More

Five Key Elements to Good Data Storage Documentation

By | Storage | No Comments

Having good data storage documentation makes you and your team more effective and efficient. Most importantly, though,  good storage documentation puts all the information you need at your fingertips in the event when something goes wrong (which, at some point, it will). What does good documentation look like, then? Data storage documentation—whether you create it yourself or your vendor provides it—should possess five key elements.

1. Ideally, your documentation should start with a descriptive overview of your environment. This is important for new team members, managers and consultants, allowing them to quickly familiarize themselves with your environment and what you are trying to accomplish with your infrastructure. From there we get into the meat of the documentation.

2. The document that you will probably reference the most is the connectivity map. This document, often a MS Visio diagram, should visually describe how each of the devices (servers, storage, switches, tape/virtual tape, etc.) in your storage environment are connected. These diagrams can be accurate down to the ports on the fibre channel switches and individual devices. It is often useful to include a chart on this diagram that has management interface IP addresses and device WWPNs. An additional diagram or chart included on this or near it is a storage layout document that provides a physical perspective on the disk groups. Maintaining the accuracy of these documents is a key factor in being able to plan, troubleshoot and maintain your storage environment.

3. The configuration detail section should have all the nitty-gritty details for each storage, connectivity and host device in your SAN environment. Much of this can be gathered with manufacturer provided tools and then compiled; however, key elements to record are management interface information (IP, user(s), passwords), driver versions, firmware versions, and software versions. Other items to record include WWPN addresses and switch port connections.

4. Regardless of the size of your environment, you should maintain a change-log. At a minimum, record when a change was made, who made it, and what was done. This will prove invaluable when something breaks or something is "fixed." The change-log can provide critical insight during failure analysis or when troubleshooting performance problems. Procedural guides provide a quick way to refresh your memory or assist new team members with tasks that are not frequently performed. Whether it is configuring a new server on the SAN, setting up a new replication consistency group or adding drives to your NAS, documenting the steps and using that document as a checklist provides consistent, repeatable results.

5. The most critical part of your documentation is the support information. Having the manufacturer phone numbers, site IDs, and device serial numbers (from the configuration detail) at your fingertips will shave critical time off of problem resolution. It is also important to have your integrator’s contact information in this section, as they can serve as a liaison with the manufacturer to escalate cases when necessary.