Disaster Recovery

Do You Learn From Data Breaches And Disasters Or Observe Them?

By | Backup, Disaster Recovery, Security | No Comments

How many articles or blog posts have you read that talked about the “lessons we learned” from 9/11, the Japanese earthquake/tsunami, the Joplin tornado, Hurricane Katrina, or <insert disastrous event here>? I see them all the time, and after reading a very interesting article in the Winter issue of the Disaster Recovery Journal (you may have to register to view the full article), I got to thinking about this concept.

What is the indication that we have learned something? The word learn has several definitions, but my favorite (thanks to is this:

to gain (a habit, mannerism, etc.) by experience, exposureto example, 

or the like; acquire …

If you learn something, you gain a new habit or mannerism; in other words, you change something.

What does it mean to observe? Again, from

to regard with attention, especially so as to see or learn something …

Just notice the difference. Learning means to take action, observing means to watch so you can learn. This really hits home with me and how I talk to my customers, because we talk A LOT about all of the lessons we have learned from various disasters. I don’t think it’s just me, either. Do a Google search on the phrase “lessons learned from hurricane Katrina” and you get 495,000 hits. Do a search on “lessons learned from Japanese tsunami” and you get 2.64 million hits. This gets talked about A LOT.

But how much are we really learning? After Katrina, how many of you proactively, objectively assessed or had someone assess your ability to maintain a revenue stream if a debilitating disaster struck your center of operations, whatever your business is? How many of you looked at what happened in Japan, or in Joplin, MO, and said: if that happened to us, we’d be able to sustain our business and we aren’t just fooling ourselves?

Let’s put this in a less dramatic and more regularly occurring context. How many of you saw the completely insane events surrounding the breach of HBGary and actually DID SOMETHING to change behavior, or build new habits to insure you didn’t suffer a similar fate? Many of us observed the event, were aghast at it’s simplicity of execution and the thoroughness with which information was exposed, but how many people actually changed the way their security is addressed and learned from the event? Have you looked at the ten year breach at Nortel, or the data breach at Symantec and set in motion a course of events in your own organization that will do everything possible to prevent similar issues in your organization?

These problems are not going away. They are becoming more and more prevalent and they are not solely the problem of global Fortune 500 companies. Any organization who does any type of business – has data that could potentially be useful for nefarious purposes in the wrong hands. It is our responsibility as stewards of the data to learn the lessons and take action to secure and protect our data as though it was our money — because it is.

Photo Credit: Cherice

Integrating EMC RecoverPoint Appliance With VMware Site Recovery Manager

By | Disaster Recovery, EMC, How To, Virtualization, VMware | No Comments

For  my “from the field” post today, I’ll be writing about integrating EMC RecoverPoint Appliance (RPA) with VMware Site Recovery Manager (SRM). However, before we dive in, if you are not familiar with RPA technology, let me explain first with a high overview:

RPAs are a block LUN IP based replication appliance. RPAs are zoned via FC with all available storage ports.  RPAs leverage a “Replication Journal” to track changes within a LUN, once the LUNs have fully seeded between the two sites, the journal log will only send changed deltas over the WAN.  This allows you to keep your existing WAN link and not spend more money on WAN expansion.  The RPA’s use of the journal log allows it to efficiently track changes to the LUNS and replicate the differences over the WAN.  Because RPA can track the changes to the LUNs it can create a Bookmark every 5-10 sec depending on the rate of change and bandwidth.  This will keep your data up to date and within a 10 second recover point objective.  RPA can also allow you to restore or test your replicated data from any one of the bookmarks created.

Leveraging RPA with VMware LUNs greatly increases the availability of your data upon any maintenance or disaster.  Because RPAs replicate block LUNs, RPAs will replicate LUNs that have datastores formatted on them.

At high overview, to failover a datastore you would:

  1. Initiate a failover on the RPA.
  2. Add the LUNs into an existing storage group in the target site.
  3. Rescan your HBAs in Vsphere O.
  4. Once the LUNs are visible you will notice a new data store available.
  5. Open the datastore and add all the VMs into inventory.
  6. Once all the VMs added configure your networking and power up your machine.

Although this procedure may seem straight forward, your RTO (Recovery Time Objective) will increase.

With VMware Site Recovery Manager (SRM) integration, plug-in the failover procedure can be automated.  With SRM you have the ability to build policies as to which v-switch you want each VM to move to as well as which VM you want to power up first.  Once the policies are built and tested (yes you can test failover), to failover your virtual site you simply hit the failover button and watch the magic happen.

SRM will automate the entire failover process and bring your site online in a matter of a few seconds or minutes depending on the size of your virtual site.  If you are considering replicating your virtual environment, I’d advise considering how long you can sustain to be down and how much data you can sustain to lose.  The use of Recover Point Appliance and Site Recovery Manager can assure that you can achieve your disaster recovery goals.

To Snapshot Or Not To Snapshot? That Is The Question When Leveraging VNX Unified File Systems

By | Backup, Data Loss Prevention, Disaster Recovery, How To, Replication, Security, VMware | No Comments

For those of you who are leveraging VNX Unified File systems, were you aware that you have the ability to checkpoint your file systems?

If you don’t know what checkpoints are, checkpoints are a point-in-time copy of your file system. The VNX gives you the ability to automate the checkpoint process. Checkpoints can run every hour, or any designated length of time, plus keep those files for whatever length of time is necessary (assuming of course that your data center has enough space available in the file system).

Checkpoints by default are read-only and are used to revert files, directories and/or the entire file system to a single point in time.  However, you can create writable checkpoints which allow you to snap an FS, export it, and test actual production data without affecting front-end production. 

VNX Checkpoint also leverages Microsoft VSS: allowing users to restore their files to previous points created by the VNX. With this integration you can allow users to restore their own files and avoid the usual calls from users who have accidently corrupted or deleted their files.  Yet, there are some concerns as to how big snapshots can get. VNX will dynamically increase the checkpoints based on how long you need them and how many you take on a daily basis. Typically the most a snapshot will take is 20% of the file system size and even that percentage is based on how much data you have and how frequently the data changes.

For file systems that are larger than 16TB, accruing successful backup can be a difficult task. With NDMP (network data management protocol) integration you are able to backup the checkpoints and store just the changes instead of the entire file system.

Take note that replicating file systems with other VNX arrays will carry your checkpoints over, giving you an off-site copy of the checkpoint made to the production FS. Backups on larger file systems can become an extremely difficult and time consuming job – by leveraging VNC Replicator and checkpoints you gain the ability to manage the availability of your data from any point in time you choose.

Photo Credit: Irargerich

Top 3 Security Resolutions For 2012: Moving Forward From “The Year Of The Breach”

By | Backup, Data Loss Prevention, Disaster Recovery, How To, Security | No Comments
I always feel a sense of renewal with the turn of the calendar. Many people use this time to set new goals for the new year and take the opportunity to get re-grounded and move toward accomplishing their goals. Yet, as I reflect on the security landscape in 2011, aptly named “The Year of the Breach”; I thought it would be a perfect time to make some resolutions for 2012 that everyone with any data to protect could benefit from.


1. Focus More on Security and Not Just on Compliance

On a day to day basis I speak to a wide range of companies and often see organizations who are so concerned about checking the box for compliance that they lose sight of actually minimizing risk and protecting data. Regardless of the regulation in the long list of alphabet soup (SOX, GLBA, PCI, HIPAA) – maintaining compliance is a daunting task.
As a security practitioner, focusing on limiting exposure to every business has always been my key concern. How can I enable the business while also minimizing risk? With this mindset, compliance helps to ensure that I am doing my due diligence and that all of my documentation is in order to prove that I’m doing my due diligence to keep our customers and stakeholders happy and protected.
2. Ready Yourself for Mobile Device Explosion
The iPad is a pretty cool device. I’m no Apple Fanboy by any stretch, but this tablet perfectly bridges the gap between my smart phone and my laptop. I am not the only one seeing these devices becoming more prevalent in the workforce as well. People are using them to take notes in meetings and give presentations, yet users are not driving the business to support these devices. Many organizations instead are simply allowing their employees to purchase their own devices and use them on corporate networks.
If employees can work remotely and be more happy and efficient with these devices, security admins can’t and shouldn’t stand in the way. We must focus on protecting these endpoints to ensure they don’t get infected with malware. We’ve also got to protect the data on these devices to ensure that corporate data isn’t misused or stolen when spread over so many variations of devices.
3. Play Offense, Not Defense
I’ve worked in IT Security for a long time and unfortunatley along the way I’ve seen and heard a lot of things that I wish I hadn’t. Yet, I can’t afford to have my head in the sand regarding security. I need to have my finger on the pulse of the organization and understand what’s happening in the business. It’s important that I also understand how data is being used and why. Once this happens, I am able to put controls in place and be in a better position to recognize when something is abnormal. With the prevalence of bot-nets and other malware, it is taking organizations 4-16 weeks before they even realize they have been compromised. Once this surfaces, they have to play catchup in order to assess the damage, clean the infection and plug the holes that were found. Breaches can be stopped before they start, if the company and/or security admin are adamant about being on the offense.
These are my top three resolutions to focus on for 2012 – what is your list? I invite you to list your security resolutions in the comment section below, I’d love to know what your organization is focused on!
Photo Credit: simplyla

Following “The Year of the Breach” IT Security Spending Is On The Rise

By | Backup, Data Loss Prevention, Disaster Recovery, RSA, Security, Virtualization | No Comments

In IT circles, the year 2011 is now known as “The Year of the Breach”. Major companies such as RSA, Sony, Epsilon, PBS, Citigroup, etc. have experienced serious high profile attacks. Which begs the question: if major players such as these huge multi-million dollar companies are being breached, what does that mean for my company? How can I take adequate precautions to ensure that I’m protecting my organization’s data?

If you’ve asked yourself these questions, you’re in good company. A recent study released by TheInfoPro states that:
37% of information security professionals are planning to increase their security spending in 2012.
In light of the recent security breaches, as well as the increased prevalence of mobile devices within the workplace, IT security is currently top of mind for many organizations. In fact, with most of the companies that IDS is working with I’m also seeing executives taking more of an interest in IT security. CEO’s and CIO’s are gaining a better understanding of technology and what is necessary to improve the company’s security position in the future. This is a huge win for security practitioners and administrators because they are now able to get the top level buy-in needed to make important investments in infrastructure. IT security is fast becoming part of the conversation when making business decisions.
I expect the IT infrastructure to continue to rapidly change as virtualization continues to grow and cloud-based infrastructures become more mature. We’re also dealing with an increasingly mobile workforce where employees are using their own laptops, smart phones and tablets instead of those issued by the company. Protection of these assets become even more important as compliance regulations become increasingly strict and true enforcement begins.
Some of the technologies that have grown in 2011 and which I foresee increasing their growth in 2012, include Data Loss Prevention, Application-aware Firewalls and Enterprise Governance Risk and Compliance. Each of these technologies focus on protecting sensitive information to ensure that authorized individuals are using this information responsibly. Moving forward into 2012, my security crystal ball tells me that everyone, top level down will increase not only their security spend, but most importantly their awareness of IT security and just how much their organizations data is worth to protect.
Photo Credit: Don Hankins

What Happens When You Poke A Large Bear (NetApp SnapMirror) And An Aggressive Wolf (EMC RecoverPoint)?

By | Backup, Clariion, Data Loss Prevention, Deduplication, Disaster Recovery, EMC, NetApp, Replication, Security, Storage | No Comments

This month I will take an objective look at two competitive data replication technologies – NetApp SnapMirror and EMC RecoverPoint. My intent is not to create a technology war, but I do realize that I am poking a rather large bear and an aggressive wolf with a sharp stick.

A quick review of both technologies:


  • NetApp’s controller based replication technology.
  • Leverages the snapshot technology that is fundamentally part of the WAFL file system.
  • Establishes a baseline image, copies it to a remote (or partner local) filer and then updates it incrementally in a semi-synchronous or asynchronous (scheduled) fashion.


  • EMC’s heterogeneous fabric layer journaled replication technology.
  • Leverages a splitter driver at the array controller, fabric switch, and/or host layer to split writes from a LUN or group of LUNs to a replication appliance cluster.
  • The split writes are written to a journal and then applied to the target volume(s) while preserving write order fidelity.

SnapMirror consistency is based on the volume or qtree being replicated. If the volume contains multiple qtrees or LUNs, those will be replicated in a consistent fashion. In order to get multiple volumes replicated in a consistent fashion, you will need to quiesce the applications or hosts accessing each of the volumes and then take snapshots of all the volumes and then SnapMirror those snapshots. An effective way to automate this process is leveraging SnapManager.

After the initial synchronization SnapMirror targets are accessible as read-only. This provides an effective source volume for backups to disk (SnapVault) or tape. The targets are not read/write accessible though, unless the SnapMirror relationship is broken or FlexClone is leveraged to make a read/write copy of the target. The granularity of the replication and recovery is based off a schedule (standard SnapMirror) or in a semi-synchronous continual replication.

When failing over, the SnapMirror relationship is simply broken and the volume is brought online. This makes DR failover testing and even site-to-site migrations a fairly simple task. I’ve found that many people use this functionality as much for migration as data protection or Disaster Recovery. Failing back to a production site is simply a matter of off-lining the original source, reversing the replication, and then failing it back once complete.

In terms of interface, SnapMirror is traditionally managed through configuration files and the CLI. However, the latest version of ONCommand System Manager includes an intuitive easy to use interface for setting up and managing SnapMirror Connections and relationships.

RecoverPoint is like TIVO® for block storage. It continuously records incoming write changes to individual LUNs or groups of LUNs in a logical container aptly called a consistency group. The writes are tracked by a splitter driver that can exist on the source host, in the fabric switch or on a Clariion (VNX) or Symmetrix (VMAXe only today) array. The host splitter driver enables replication between non-EMC and EMC arrays (Check ESM for latest support notes).

The split write IO with RecoverPoint is sent to a cluster of appliances that package, compress and de-duplicate the data, then sends it over a WAN IP link or local fibre channel link. The target RecoverPoint Appliance then writes the data to the journal. The journaled writes are applied to the target volume as time and system resources permit and are retained as long as there is capacity in the journal volume in order to be able to rewind the LUN(s) in the consistency group to any point in time retained.

In addition to remote replication, RecoverPoint can also replicate to local storage. This option is available as a standalone feature or in conjunction with remote replication.

RecoverPoint has a standalone Java application that can be used to manage all of the configuration and operational features. There is also integration for management of consistency groups by Microsoft Cluster Services and VMWare Site Recovery Manager. For application consistent “snapshots” (RecoverPoint calls them “bookmarks”) EMC Replication Manager or the KVSS command line utilities can be leveraged. Recently a “light” version of the management tool has been integrated into the Clariion/VNX Unisphere management suite.

So, sharpening up the stick … NetApp SnapMirror is a simple to use tool that leverages the strengths of the WAFL architecture to replicate NetApp volumes (file systems) and update them either continuously or on a scheduled basis using the built-in snapshot technology. Recent enhancements to the System Manager have made it much simpler to use, but it is limited to NetApp controllers. It can replicate SAN volumes (iSCSI or FC LUNs) in NetApp environments – as they are essentially single files within a Volume or qtree.

RecoverPoint is a block-based SAN replication tool that splits writes and can recover to any point in time which exists in the journal volume. It is not built into the array, but is a separate appliance that exists in the fabric and leverages array, and fabric or host based splitters. I would make the case that RecoverPoint is a much more sophisticated block-based replication tool that provides a finer level of recoverable granularity, at the expense of being more complicated.

 Photo Credit: madcowk

EMC Avamar Virtual Image Backup: File Level Restore Saves Customer from “Blue Screen” Death

By | Avamar, Backup, Disaster Recovery, EMC | No Comments

Recently, a customer of ours had a mission critical Virtual Machine “blue screen.” Yikes! Good news was their environment was leveraging Avamar Virtual Image backups. Bad news was the VM was in an unstable state for a while, and every time the VM was restored it continued to “blue screen.” Therefore, the OS was corrupted—one of the many joys of IT life!

To lose my title of Debbie Downer, let me explain that their environment also was leveraging “FLR” Avamar Virtual File Level Restore. I must say in my experience restoring applications, the data is priority one.

This picture couldn’t have been more beautiful: they had a win 2k8 template with SQL loaded, and they simply extracted the database files from the VMDK backup using FLR and restored them to the new VM, with the data intact and up to date.  Take that tape! Never had to request or load tapes to restore anything 5 years later!

If you are not familiar with EMC Avamar FLR, basically it is the ability to extract single objects out of the virtual image backups. This is done with a proxy agent which exists within your virtual environment that will mount your backups and extract any data that exist within the VM. That means one single backup to your VM and the ability to restore anything within the VMDK without having to load a new VM.

This feature can be used in many ways: one being the dramatic example I just gave, another being the ability to use the data files for testing in other VMs. Although this is just a single feature example of the many abilities of Avamar, its usage will greatly reduce your RPO and RTO.

In my experience, leveraging Avamar and virtual file restore will improve your virtual restoring procedures and bring the peace of mind that your data is within arms reach anytime of the day. As I continue to post about Avamar features and capabilities from the field, I’ve developed this as my slogan for the series: Keep your backups simple … and down with tape!

Photo Credit: altemark

Disaster Recovery, Past vs. Present: New Tech Offers Lean-and-Mean DR Made Easier & Less Expensive

By | Disaster Recovery, Replication | No Comments

Not so long ago, disaster recovery collocation sites were a topic that everyone wanted to talk about … but nobody wanted to invest in. This was largely because the state of the technology left these sites sitting cold—it was simply the most expensive insurance policy one could think of. With the advent of dramatically dropping storage costs, improving costs of WAN connectivity, a wealth of effective and robust replication technologies, and (most importantly) the abstraction of the application from the hardware, we have a new game on our hands.

[image align=”center” width=”400″ height=”267″][/image]

Ye Olde Way:

WAN pipe: BIG, fast, low latency and EXPENSIVE. Replication technologies were not tolerant of falling too far behind. The penalty for falling behind was often a complete resynchronization of both sites sometimes even requiring outages.

Servers: Exact replicas on each side of the solution. There is nothing so thrilling to the CFO as buying servers so they can depreciate as they sit, idle, waiting for the day when the hot site failed. Hopefully everyone remembered to update BIOS code on both sites, otherwise the recovery process was going to be even more painful.

Applications: Remember to do everything twice! Any and every change had to be done again at the collocation site or the recovery would likely fail. By the way, every manual detail counts. Your documentation should rival NASA flight plans to ensure the right change happens in the right order. Aren’t “patch Tuesdays” fun now!

Storage: Expensive since consolidated storage was new and pricey. Host-based replication took up so many resources that it crippled production and often caused more issues with the server than was worth the risk. Array-based replication strategies at the time were young and filled with holes. Early arrays only spoke Fibre Channel, so we had the added pleasure of converting Fibre Channel to IP with pricey bridges.

Process: Every application was started differently and requires a subject matter expert to ensure success. No shortage of finger pointing to see whose fault it is that the application was grumpy on restore. Testing these solutions required the entire team offsite for days and sometimes weeks to achieve only partial success. is looking like a better DR strategy for IT at this point…


WAN pipe: Not as big, fast, low latency and much less expensive. Replication technologies have come a long way providing in-band compression and, due to the much larger processors and faster disk subsystems, even host-based replication strategies are viable for all but the busiest apps. But again, with the falling cost of array-based replication, you may not need host-based replication.

Servers: Keep the CPUs in roughly the same family and you are good to go. I am flexible to choose different manufacturers and specifications to serve my workload best without being overly concerned about messing up my production or disaster recovery farm’s viability. Even better, I can be running test, dev, and even some production apps over at the secondary site while I am waiting for the big bad event. Even your CFO can get on board with that.

Applications: Thanks to server virtualization, I ship the entire image of the server (rather than just data) as replication continues; whatever I do on one side is automatically on the other side. Most importantly, the image is a perfect replica of production. Go one step further and create snapshots before you work on the application in production and disaster recovery, in case you need to fail over or fail backwards to rollback a misbehaving update.

Storage: Remarkably cost effective, especially in the light of solid state drives and connectivity options. Remarkable speed, flexibility, and features are available from a wealth of manufacturers. Support is still key, but even the largest manufacturers are working harder than ever to win your business with huge bang-for-buck solutions. Even small shops can afford storage solutions that ran $1M+ only 5 years ago. Just these “entry level” solutions are doing more than anything on the market back then.

Process: We get to pick how automated we want to go. Solutions exist to detect outages and restart entire environments automatically. Network protocols exist to flatten networks between sites. “Entry level” replication solutions will automatically register clones of machines at offsite locations for simple click-to-start recovery. What does this all mean to the IT team? Test well and test often. Administrators can start machines in the remote sites without affecting production, test to their hearts’ content, hand off to the application owners for testing, and then reclaim the workspace in minutes, instead of days.

We play a new game today. Lots of great options are out there are providing immense value. Do your homework and find the right mix for your infrastructure.