Previously, I talked about how to set up ControlShift in the cloud. In this post, I’ll dive into the ControlShift cloud-native disaster recovery solution, and how it leverages VMware Cloud on AWS to do on-demand disaster recovery.

While this post is about recovery to VMware Cloud on AWS, Datrium has announced that it will offer Automatrix and ControlShift to Microsoft Azure in 2020, following Microsoft’s announcement to support running the VMware stack (including VMware vSphere, vSAN, NSX, and vCenter) in Azure with Azure VMware Solutions.

To recap, ControlShift is a cloud-native disaster recovery solution for virtual machine based workloads that offers a consistent experience, regardless of the underlying infrastructure. ControlShift runs inside your own cloud tenant or as SaaS.

CloudShift makes disaster recovery simple and foolproof, and leverages some of the cloud-native design paradigmsto deliver low RPO and RTO, while keeping cost down.

Remembering the old days…

Remember how DR was insanely expensive not that long ago? You had to do that whole second datacenter thing with a bunch of servers, storage, networking and software. And have it running 24/7 because you couldn’t spin up an entire datacenter in minutes.. Even if you paid a service provider to host the environment for you to save a little on cost, your RTO and RPO requirements forced you to keep the environment running 24/7.

I’ve always looked at the second-datacenter-for-DR as an expensive and clunky way of insuring your company against cybercrime and datacenter failures. But there just weren’t many alternatives for monolithic (VM-based) workloads. I’ve designed DR datacenters to be used for non-production during normal operations, but then switch over to run the production after a disaster. Had a lot of fun creating and implementing those designs, too. While I liked the complexity of that in my job as an architect, I didn’t like it for companies having to pay for and manage it.

But cloud changes the game

But alas. As with so many other things, the cloud makes consumption of IT infrastructure on-demand, self-service as simple as that.

For DR, the cloud made setting up and managing a DR site simpler by taking the physical aspect out of the equation. Veeam did this back in the day using their partner network, and BaaS and DRaaS were very successful services for managed service providers.

Spinning up an environment, even if an MSP does it for you, is complex from an infrastructure perspective and still takes time. So even in the cloud, you’d have to set up a permanent environment for DR, racking up the MSP or cloud bill.

ControlShift fixes that problem.

ControlShift: cold-start disaster recovery

ControlShift’s party trick is the ability to (cold-)start a DR-environment in the cloud. First, ControlShift deploys itself into a cloud tenant (read How Datrium deploys ControlShift into the cloud if you want to know how). It can then take an on-prem or cloud-based backup, and use it as the source for a DR orchestration plan into the on-demand cloud environment for restores.

Now, ControlShift supports different orchestration sources and destinations, but we’ll focus on the on-prem to VMware Cloud on AWS scenario.

There’s a lot going on in this screenshot, but let’s focus on the left half of it. During the test or failover run of a DR plan, ControlShift will create the VMware Cloud on AWS tenant, connect it to the account running ControlShift, and deploy a Datrium storage appliance in the VMC on AWS SDDC. This allows ControlShift to make backups, that are stored in the S3 bucket, available to the SDDC. This NFS VM runs the backups from S3, and is backed by an NVMe cache. Datrium indicates performance of this storage layer is comparable to a hybrid HDD/SSD storage array. Cache hits will be fast (sub-millisecond), cache misses go to S3 and are slower, in the range of 20 milliseconds. A single NFS VM will max out the physical connection between the SDDC and S3, so adding more won’t help S3 performance, although Datrium are looking at scalability.

While this solution does about 50k IOps and about 500 megabytes of sustained throughput between S3 and the NFS VM, it’s not as fast as native VSAN on NVMe. It’s worthwhile to remember that this Datrium array is an intermediate storage location. VMs move away to native VSAN using Storage vMotion if the failover is made permanent. For test failovers, migrating to native VSAN makes little sense due to the time it takes and the cost involved.

Can ControlShift do traditional DR too?

One of the major downsides of the technologically very cool solution of cold-starting an SDDC is the relatively long time it takes to instantiate a VMware Cloud on AWS, up to 90 minutes. For some failover scenarios, this is not acceptable.

ControlShift does support warm and a hot scenarios, where a barebones SDDC that autoscales in a DR-scenario or a full SDDC are on standby, much like the DR sites of before. These work much in the same way, but they do cost more money.

What about fail-back?

During the presentation at Cloud Field Day 5, I kept thinking two things. (1) how cool is it to be able to cold-start an entire SDDC for Disaster Recovery. And (2) how are they ever going to do proper fail-back if it’s on native VSAN, a different storage platform entirely?

Luckily, they have another party trick up their sleeve. They’re able to do a proper fail-back without rehydrating the entire dataset; syncing only the changes since the on-prem environment made and sent over the last snapshot on which the DR was based. Here’s how:

  1. They create a vSphere snapshot during failover, and immediately delete it, creating a VADP block map.
  2. They do a permanent failover, storage vMotioning the VM to native VSAN
  3. When it’s time to go back home, they take a second vSphere snapshot of each VM, and get the changed blocks since it was permanently failed over.
  4. They copy those sectors, put them in a Datrium snapshot and replicate them to the on-prem environment.
  5. From there, it’s the usual fail-back sequence to get the VM’s up and running on the primary location.

Consistent Experience

In this deep-dive, I’m showing you a lot of what’s going on behind the scenes. In reality, Datrium does all of this heavy lifting from a single, simple UI: the vSphere Client. You use that interface for all of your day-to-day VM and storage-related work. Backup and DR are a consistent part of that: instantiating a VMC on AWS SDDC is done from within the same console where you manage your VMs, configure off-site backup and DR: the on-prem vSphere Web Client. Same goes for setting up ControlShift, the Cloud DVX and all of the network integrations between the various cloud accounts: all part of a couple of simple workflows in the Datrium Automatrix and ControlShift UI.

All workloads during the DR will go through the NFS VM on to VSAN. But prior to final move from NFS VM to VSAN, Datrium takes a snap (and immediately removes it) and records the VADP state. Then, if you do a planned failback, you can ask VMware which blocks have changed since the failover, and only hydrate those back to on-prem.

Wrapping up

Datrium’s ControlShift is a well-thought out product.

  1. ControlShift itself is a flexible, deploy-where-needed solution. It uses cloud-native paradigms for availability, cost and resiliency reasons.
  2. It takes full advantage of VMware Cloud on AWS’ ability to be deployed on-demand and self-service, creating a tangible cost advantage with cold and warm start features while removing the vast complexity of traditional secondary datacenters.
  3. ControlShift doesn’t need to migrate between VM and VM disk formats to do DR to the cloud, saving time and preventing additional complexity.
  4. And there’s some real magic in the failing back, preventing full replication of the VM’s when going back to the on-prem datacenter. Again, this saves time, cost and complexity.
  5. And it does all this with one consistent experience. There’s no plethora of interfaces and tools across VMware on-prem, VMware Cloud on AWS, the storage platform, backups and DR replication. It’s just one single interface, that simplifies all of the complexity I showed in this post.

All in all, I’m impressed by the quality of ControlShift. It ticks many, if not all boxes if I were to look at a DR solution today. From a technology perspective, it’s built using modern-day, cloud-native paradigms, without botching support for native VMware VMs in AWS.

If you want to see and hear the full story, I highly recommend you watch their presentation at Cloud Field Day 5, https://www.youtube.com/watch?v=cvVnikk7oMM.