Friday, April 19, 2013

Reducing your RTO by changing the defaults of the VMware tools service recovery when using SRM


VMware SRM is an excellent solution for recovering your data center in the event of a disaster.  Granted you have to have a supported SAN replication technology in place before even looking at a such a solution, but if you do and you are heavily virtualized using VMware a your hypervisor, Site Recovery Manager is great for recovering sites in both a planned failover or disaster.

Anyone who as ever used VMware SRM knows that a lot of the recovery time (or RTO) is waiting for things to complete.  Drives have to be reattached, rescans of datastores have to be completed, boot order has to be met, etc.  The logic VMware uses for their scripts is flawless and I have not encountered an issue with any of the recovery plans from SRM 5.x and up.  

Watching many test and full recovery plans run in action, a good amount of time is waiting for the VMware tools service to start on both the initial startup of the VM, and when it is restarted to change the guest ip address, and specified boot order.  There is an advanced option to disable waiting on the VMware tools service to report as running, but I would advise against that because just booting the VMs does not guarantee that dependent services from other machines in the recovery group are fully up and functional. Verifying that the VMware tools service is running before proceeding is much more helpful.

Also a lot is going on during a recovery.  Dependent on the size of your recovery group, you are essentially creating a boot storm during either a test or a full recovery plan.  Sometimes those VMware tools hang, and eventually will restart successfully.  Especially if they are kept up to date which is critical for SRM. 
Since we are waiting on the “VMware tools” service to start before moving to the next phase of the plan, I thought I would examine the restart policy of the VMware tools services.  Here is a screen shot of what the default behavior is of the service.  






My initial thought was it’s not too bad.  At least it is set to attempt to start the service again.  But looking at the numbers, 5 minutes is a long time to wait for the OS to attempt to start the service again.   I could be running a recovery plan of upward of 25 or 50 virtual machines inside, and I do not want to wait 5 minute intervals to start that service again.  I want it done immediately.  Here is the solution I came up with that I run on all VMs that I configure inside of SRM.

001
002
003
004
005
006
007
008
009
010
Connect-VIServer YourRecoveryvCenter

$var = Get-Folder SRM_Servers | Get-VM | Select Name
ForEach ($guy in $var)
{
 $var1 = "\\"
 $var2 = $var1 + $guy.Name
 cmd /c "sc.exe $var2 failure VMtools reset= 86400 actions= restart/5000"

 }

Next I needed to identify the servers I protect with SRM and the ones that I don’t.  I’m not going to make this a system wide change to all of my servers, jut the boxes using SRM.  At my recovery site, I had a folder in vCenter that only included placeholder .vmx icons.  This will work perfectly as I can key in on just the servers I want to change.  I can build a variable of just the list of SRM servers and loop through the entire set.  Perfect!

Inside the loop, I am basically concatenating a string that sets up the command I want to run with sc.exe along with the correct parameters for recovery.  My concatenation includes the double wacks “\\” and the name of the server I want to configure.  I want the service to be restarted immediately and not after 5 minutes.  I found those correct switches and placed them in the command.  Each server that is inside that specific folder should receive the same settings when I am finished.  






















Look at that, lower RTO, faster recovery, everyone is happy.  Zero minutes I have to wait on the restart as opposed to the 5 minutes.  A potential 15 minutes of waiting per server is potentially saved.     




Wednesday, April 3, 2013

Fixing PVS Local Cache disks provisioned on vSphere


If you have a large PVS environment, you may get stuck with a situation as to where you need to replace all of the local cache disks.  It may be an issue where you need more space a larger drive, or in my particular instance, someone partitioned the cache disk as GPT which apparently does not play well with PVS. 

Here is a thread on the Citrix support forums facing the same issue we experienced:


Basically since the target cache device was partitioned as GPT, it creates the cache file on the PVS repository instead of having it use the much faster local cache disk.  To make matters worse that GPT partition was cloned over a hundred times.  We had a real problem on our hands as this was a good portion of our newly provisioned XenApp 6.5 farm running on vSphere 5.0

The end users didn't really complain, but it was easily identifiable by the boot times that something was not right.  The performance was comparable to when the device is running in private mode. 
I no longer primarily support XenApp, but I was asked to come up with a solution with the least amount of downtime.  We also wanted to prevent re-provisioning the targets because that seemed like a bit too much work for someone to setup and create new servers in the PVS console. 
Situations like these, I start digging through the PowerCLI cmdlets for a solution.  I thought the Copy-HardDisk cmdlet looked promising.  My plan was to copy a clean MBR partitioned drive and replace the existing GPT drive while the server was powered off.
I had a powered off server that had a properly MBR partitioned cache disk attached so I will use that as my source.  I had our XenApp administrator send me a list of all of the GPT partitioned servers so I could begin working on that.  The solutions composed of two separate scripts, one to identify where the servers bad cache disk resides, and the other to replace it. 
Here is a snipet of the information the XenApp administrator sent me.  This will feed the initial script that discovers the existing bad drive locations.  I called mine GPT.csv











Next  is the script that will discover, and then write to a CSV of exactly what needs to be replaced:
001
002
003
004
005
006
007
008
009
010
011
012
013


Connect-VIServer yourvCenter

$var2 = Import-Csv C:\scripts\csv\gpt.csv
ForEach ($bad in $var2)
{

$var = Get-VM $bad.Name | Get-HardDisk | Select Parent, FileName

$var | Export-CSV -append C:\scripts\csv\baddrives.csv -NoTypeInformation

}

Basically as I normally do, I am just looping through the feeder csv file, GPT.csv.  For each entry, I am exporting both the name and the full path to the bad drive into another csv file, baddrives.csv.  This way I have all of the information I need to fix the issue.  The output for baddrives.csv should like something like this:






This way the script knows exactly what the name and the mapping of the drive is to replace it.  Now that we have the mappings of the drives, we can now replace the drive.  Once again, we are going to loop through the second csv file we have generated, and power down each VM individually.  Make sure there are no active sessions on each of the PVS clients that have the cache disks that needed to be replaced.  In my example, the XenApp administrator would offline each server the day before so I had free reign to replace the drives. 

As it loops through the set, each VM is powered down cold.  The reason we don’t have to use Shutdown-VMGuest because since it’s PVS, we don’t really care that the VM is powered off ungracefully.  Next we grab the vmdk from the known good powered off VM called GOOD_DISK.  Then we perform a complete replace over the top of the VMs with the bad cache disk.  We had to use the –Force switch to force the replacement of the existing disk because that’s exactly what we want.  We do not want to have duplicate disks attached to the VM, as well as the VM vmx file is already mapped to this name.  That’s the true power of the script.  We do not have to handle the rename of the new disk, nor do we have to worry about attaching the drive since the VM is already mapped to the existing name.

1
2
3
4
5
67
8
9
10
Connect-VIServer YourvCenter
$list = Import-Csv C:\scripts\csv\baddrive.csv
ForEach ($bad in $list)
{
Stop-VM $bad.Parent -Confirm:$false
Get-VM GOOD_DISK | Get-HardDisk | Copy-HardDisk -DestinationPath $bad.FileName -Force
Start-VM $bad.Parent
}

Note: vCenter does throw an error using the –Force switch.  Just safely ignore, the script continues to work like a champ.

 





Note 2:  Test HEAVILY