Friday, April 19, 2013

Reducing your RTO by changing the defaults of the VMware tools service recovery when using SRM


VMware SRM is an excellent solution for recovering your data center in the event of a disaster.  Granted you have to have a supported SAN replication technology in place before even looking at a such a solution, but if you do and you are heavily virtualized using VMware a your hypervisor, Site Recovery Manager is great for recovering sites in both a planned failover or disaster.

Anyone who as ever used VMware SRM knows that a lot of the recovery time (or RTO) is waiting for things to complete.  Drives have to be reattached, rescans of datastores have to be completed, boot order has to be met, etc.  The logic VMware uses for their scripts is flawless and I have not encountered an issue with any of the recovery plans from SRM 5.x and up.  

Watching many test and full recovery plans run in action, a good amount of time is waiting for the VMware tools service to start on both the initial startup of the VM, and when it is restarted to change the guest ip address, and specified boot order.  There is an advanced option to disable waiting on the VMware tools service to report as running, but I would advise against that because just booting the VMs does not guarantee that dependent services from other machines in the recovery group are fully up and functional. Verifying that the VMware tools service is running before proceeding is much more helpful.

Also a lot is going on during a recovery.  Dependent on the size of your recovery group, you are essentially creating a boot storm during either a test or a full recovery plan.  Sometimes those VMware tools hang, and eventually will restart successfully.  Especially if they are kept up to date which is critical for SRM. 
Since we are waiting on the “VMware tools” service to start before moving to the next phase of the plan, I thought I would examine the restart policy of the VMware tools services.  Here is a screen shot of what the default behavior is of the service.  






My initial thought was it’s not too bad.  At least it is set to attempt to start the service again.  But looking at the numbers, 5 minutes is a long time to wait for the OS to attempt to start the service again.   I could be running a recovery plan of upward of 25 or 50 virtual machines inside, and I do not want to wait 5 minute intervals to start that service again.  I want it done immediately.  Here is the solution I came up with that I run on all VMs that I configure inside of SRM.

001
002
003
004
005
006
007
008
009
010
Connect-VIServer YourRecoveryvCenter

$var = Get-Folder SRM_Servers | Get-VM | Select Name
ForEach ($guy in $var)
{
 $var1 = "\\"
 $var2 = $var1 + $guy.Name
 cmd /c "sc.exe $var2 failure VMtools reset= 86400 actions= restart/5000"

 }

Next I needed to identify the servers I protect with SRM and the ones that I don’t.  I’m not going to make this a system wide change to all of my servers, jut the boxes using SRM.  At my recovery site, I had a folder in vCenter that only included placeholder .vmx icons.  This will work perfectly as I can key in on just the servers I want to change.  I can build a variable of just the list of SRM servers and loop through the entire set.  Perfect!

Inside the loop, I am basically concatenating a string that sets up the command I want to run with sc.exe along with the correct parameters for recovery.  My concatenation includes the double wacks “\\” and the name of the server I want to configure.  I want the service to be restarted immediately and not after 5 minutes.  I found those correct switches and placed them in the command.  Each server that is inside that specific folder should receive the same settings when I am finished.  






















Look at that, lower RTO, faster recovery, everyone is happy.  Zero minutes I have to wait on the restart as opposed to the 5 minutes.  A potential 15 minutes of waiting per server is potentially saved.     




1 comment: