Not so long ago, disaster recovery collocation sites were a topic that everyone wanted to talk about … but nobody wanted to invest in. This was largely because the state of the technology left these sites sitting cold—it was simply the most expensive insurance policy one could think of. With the advent of dramatically dropping storage costs, improving costs of WAN connectivity, a wealth of effective and robust replication technologies, and (most importantly) the abstraction of the application from the hardware, we have a new game on our hands.[image align=”center” width=”400″ height=”267″]http://idsforward.wpengine.com/wp-content/uploads/2011/05/dr-now-and-then-newest.jpg[/image]
Ye Olde Way:
WAN pipe: BIG, fast, low latency and EXPENSIVE. Replication technologies were not tolerant of falling too far behind. The penalty for falling behind was often a complete resynchronization of both sites sometimes even requiring outages.
Servers: Exact replicas on each side of the solution. There is nothing so thrilling to the CFO as buying servers so they can depreciate as they sit, idle, waiting for the day when the hot site failed. Hopefully everyone remembered to update BIOS code on both sites, otherwise the recovery process was going to be even more painful.
Applications: Remember to do everything twice! Any and every change had to be done again at the collocation site or the recovery would likely fail. By the way, every manual detail counts. Your documentation should rival NASA flight plans to ensure the right change happens in the right order. Aren’t “patch Tuesdays” fun now!
Storage: Expensive since consolidated storage was new and pricey. Host-based replication took up so many resources that it crippled production and often caused more issues with the server than was worth the risk. Array-based replication strategies at the time were young and filled with holes. Early arrays only spoke Fibre Channel, so we had the added pleasure of converting Fibre Channel to IP with pricey bridges.
Process: Every application was started differently and requires a subject matter expert to ensure success. No shortage of finger pointing to see whose fault it is that the application was grumpy on restore. Testing these solutions required the entire team offsite for days and sometimes weeks to achieve only partial success. Monster.com is looking like a better DR strategy for IT at this point…
WAN pipe: Not as big, fast, low latency and much less expensive. Replication technologies have come a long way providing in-band compression and, due to the much larger processors and faster disk subsystems, even host-based replication strategies are viable for all but the busiest apps. But again, with the falling cost of array-based replication, you may not need host-based replication.
Servers: Keep the CPUs in roughly the same family and you are good to go. I am flexible to choose different manufacturers and specifications to serve my workload best without being overly concerned about messing up my production or disaster recovery farm’s viability. Even better, I can be running test, dev, and even some production apps over at the secondary site while I am waiting for the big bad event. Even your CFO can get on board with that.
Applications: Thanks to server virtualization, I ship the entire image of the server (rather than just data) as replication continues; whatever I do on one side is automatically on the other side. Most importantly, the image is a perfect replica of production. Go one step further and create snapshots before you work on the application in production and disaster recovery, in case you need to fail over or fail backwards to rollback a misbehaving update.
Storage: Remarkably cost effective, especially in the light of solid state drives and connectivity options. Remarkable speed, flexibility, and features are available from a wealth of manufacturers. Support is still key, but even the largest manufacturers are working harder than ever to win your business with huge bang-for-buck solutions. Even small shops can afford storage solutions that ran $1M+ only 5 years ago. Just these “entry level” solutions are doing more than anything on the market back then.
Process: We get to pick how automated we want to go. Solutions exist to detect outages and restart entire environments automatically. Network protocols exist to flatten networks between sites. “Entry level” replication solutions will automatically register clones of machines at offsite locations for simple click-to-start recovery. What does this all mean to the IT team? Test well and test often. Administrators can start machines in the remote sites without affecting production, test to their hearts’ content, hand off to the application owners for testing, and then reclaim the workspace in minutes, instead of days.
We play a new game today. Lots of great options are out there are providing immense value. Do your homework and find the right mix for your infrastructure.