VMware SRM is an excellent solution for recovering your data center in the event of a disaster. Granted you have to have a supported SAN replication technology in place before even looking at a such a solution, but if you do and you are heavily virtualized using VMware a your hypervisor, Site Recovery Manager is great for recovering sites in both a planned failover or disaster.
Anyone who as ever used VMware SRM knows that a lot of the
recovery time (or RTO) is waiting for things to complete. Drives have to be reattached, rescans of datastores
have to be completed, boot order has to be met, etc. The logic VMware uses for their scripts is
flawless and I have not encountered an issue with any of the recovery plans from SRM 5.x and up.
Watching many test
and full recovery plans run in action, a good amount of time is waiting for the
VMware tools service to start on both the initial startup of the VM, and when
it is restarted to change the guest ip address, and specified boot order. There is an advanced option to disable
waiting on the VMware tools service to report as running, but I would advise
against that because just booting the VMs does not guarantee that dependent
services from other machines in the recovery group are fully up and
functional. Verifying that the VMware tools service is running before proceeding is much more helpful.
Also a lot is going on during a recovery. Dependent on the size of your recovery group,
you are essentially creating a boot storm during either a test or a full
recovery plan. Sometimes those VMware
tools hang, and eventually will restart successfully. Especially if they are kept up to date which
is critical for SRM.
Since we are waiting on the “VMware tools” service to start
before moving to the next phase of the plan, I thought I would examine the
restart policy of the VMware tools services.
Here is a screen shot of what the default behavior is of the
service.
My initial thought was it’s not too bad. At least it is set to attempt to start the
service again. But looking at the
numbers, 5 minutes is a long time to wait for the OS to attempt to start the
service again. I could be running a
recovery plan of upward of 25 or 50 virtual machines inside, and I do not want
to wait 5 minute intervals to start that service again. I want it done immediately. Here is the solution I came up with that I
run on all VMs that I configure inside of SRM.
001
002 003 004 005 006 007 008 009 010 |
Connect-VIServer YourRecoveryvCenter
$var = Get-Folder SRM_Servers | Get-VM | Select Name ForEach ($guy in $var) { $var1 = "\\" $var2 = $var1 + $guy.Name cmd /c "sc.exe $var2 failure VMtools reset= 86400 actions= restart/5000" } |
Next I needed to identify the servers I protect with SRM and
the ones that I don’t. I’m not going to
make this a system wide change to all of my servers, jut the boxes using
SRM. At my recovery site, I had a folder
in vCenter that only included placeholder .vmx icons. This will work perfectly
as I can key in on just the servers I want to change. I can build a variable of just the list of
SRM servers and loop through the entire set.
Perfect!
Inside the loop, I am basically concatenating a string that
sets up the command I want to run with sc.exe along with the correct parameters
for recovery. My concatenation includes
the double wacks “\\” and the name of the server I want to configure. I want the service to be restarted immediately
and not after 5 minutes. I found those
correct switches and placed them in the command. Each server that is inside that specific
folder should receive the same settings when I am finished.
Look at that, lower RTO, faster recovery, everyone is
happy. Zero minutes I have to wait on the restart as opposed to the 5 minutes. A potential 15 minutes of waiting per server is potentially saved.