Friday, April 19, 2013

Reducing your RTO by changing the defaults of the VMware tools service recovery when using SRM


VMware SRM is an excellent solution for recovering your data center in the event of a disaster.  Granted you have to have a supported SAN replication technology in place before even looking at a such a solution, but if you do and you are heavily virtualized using VMware a your hypervisor, Site Recovery Manager is great for recovering sites in both a planned failover or disaster.

Anyone who as ever used VMware SRM knows that a lot of the recovery time (or RTO) is waiting for things to complete.  Drives have to be reattached, rescans of datastores have to be completed, boot order has to be met, etc.  The logic VMware uses for their scripts is flawless and I have not encountered an issue with any of the recovery plans from SRM 5.x and up.  

Watching many test and full recovery plans run in action, a good amount of time is waiting for the VMware tools service to start on both the initial startup of the VM, and when it is restarted to change the guest ip address, and specified boot order.  There is an advanced option to disable waiting on the VMware tools service to report as running, but I would advise against that because just booting the VMs does not guarantee that dependent services from other machines in the recovery group are fully up and functional. Verifying that the VMware tools service is running before proceeding is much more helpful.

Also a lot is going on during a recovery.  Dependent on the size of your recovery group, you are essentially creating a boot storm during either a test or a full recovery plan.  Sometimes those VMware tools hang, and eventually will restart successfully.  Especially if they are kept up to date which is critical for SRM. 
Since we are waiting on the “VMware tools” service to start before moving to the next phase of the plan, I thought I would examine the restart policy of the VMware tools services.  Here is a screen shot of what the default behavior is of the service.  






My initial thought was it’s not too bad.  At least it is set to attempt to start the service again.  But looking at the numbers, 5 minutes is a long time to wait for the OS to attempt to start the service again.   I could be running a recovery plan of upward of 25 or 50 virtual machines inside, and I do not want to wait 5 minute intervals to start that service again.  I want it done immediately.  Here is the solution I came up with that I run on all VMs that I configure inside of SRM.

001
002
003
004
005
006
007
008
009
010
Connect-VIServer YourRecoveryvCenter

$var = Get-Folder SRM_Servers | Get-VM | Select Name
ForEach ($guy in $var)
{
 $var1 = "\\"
 $var2 = $var1 + $guy.Name
 cmd /c "sc.exe $var2 failure VMtools reset= 86400 actions= restart/5000"

 }

Next I needed to identify the servers I protect with SRM and the ones that I don’t.  I’m not going to make this a system wide change to all of my servers, jut the boxes using SRM.  At my recovery site, I had a folder in vCenter that only included placeholder .vmx icons.  This will work perfectly as I can key in on just the servers I want to change.  I can build a variable of just the list of SRM servers and loop through the entire set.  Perfect!

Inside the loop, I am basically concatenating a string that sets up the command I want to run with sc.exe along with the correct parameters for recovery.  My concatenation includes the double wacks “\\” and the name of the server I want to configure.  I want the service to be restarted immediately and not after 5 minutes.  I found those correct switches and placed them in the command.  Each server that is inside that specific folder should receive the same settings when I am finished.  






















Look at that, lower RTO, faster recovery, everyone is happy.  Zero minutes I have to wait on the restart as opposed to the 5 minutes.  A potential 15 minutes of waiting per server is potentially saved.     




Wednesday, April 3, 2013

Fixing PVS Local Cache disks provisioned on vSphere


If you have a large PVS environment, you may get stuck with a situation as to where you need to replace all of the local cache disks.  It may be an issue where you need more space a larger drive, or in my particular instance, someone partitioned the cache disk as GPT which apparently does not play well with PVS. 

Here is a thread on the Citrix support forums facing the same issue we experienced:


Basically since the target cache device was partitioned as GPT, it creates the cache file on the PVS repository instead of having it use the much faster local cache disk.  To make matters worse that GPT partition was cloned over a hundred times.  We had a real problem on our hands as this was a good portion of our newly provisioned XenApp 6.5 farm running on vSphere 5.0

The end users didn't really complain, but it was easily identifiable by the boot times that something was not right.  The performance was comparable to when the device is running in private mode. 
I no longer primarily support XenApp, but I was asked to come up with a solution with the least amount of downtime.  We also wanted to prevent re-provisioning the targets because that seemed like a bit too much work for someone to setup and create new servers in the PVS console. 
Situations like these, I start digging through the PowerCLI cmdlets for a solution.  I thought the Copy-HardDisk cmdlet looked promising.  My plan was to copy a clean MBR partitioned drive and replace the existing GPT drive while the server was powered off.
I had a powered off server that had a properly MBR partitioned cache disk attached so I will use that as my source.  I had our XenApp administrator send me a list of all of the GPT partitioned servers so I could begin working on that.  The solutions composed of two separate scripts, one to identify where the servers bad cache disk resides, and the other to replace it. 
Here is a snipet of the information the XenApp administrator sent me.  This will feed the initial script that discovers the existing bad drive locations.  I called mine GPT.csv











Next  is the script that will discover, and then write to a CSV of exactly what needs to be replaced:
001
002
003
004
005
006
007
008
009
010
011
012
013


Connect-VIServer yourvCenter

$var2 = Import-Csv C:\scripts\csv\gpt.csv
ForEach ($bad in $var2)
{

$var = Get-VM $bad.Name | Get-HardDisk | Select Parent, FileName

$var | Export-CSV -append C:\scripts\csv\baddrives.csv -NoTypeInformation

}

Basically as I normally do, I am just looping through the feeder csv file, GPT.csv.  For each entry, I am exporting both the name and the full path to the bad drive into another csv file, baddrives.csv.  This way I have all of the information I need to fix the issue.  The output for baddrives.csv should like something like this:






This way the script knows exactly what the name and the mapping of the drive is to replace it.  Now that we have the mappings of the drives, we can now replace the drive.  Once again, we are going to loop through the second csv file we have generated, and power down each VM individually.  Make sure there are no active sessions on each of the PVS clients that have the cache disks that needed to be replaced.  In my example, the XenApp administrator would offline each server the day before so I had free reign to replace the drives. 

As it loops through the set, each VM is powered down cold.  The reason we don’t have to use Shutdown-VMGuest because since it’s PVS, we don’t really care that the VM is powered off ungracefully.  Next we grab the vmdk from the known good powered off VM called GOOD_DISK.  Then we perform a complete replace over the top of the VMs with the bad cache disk.  We had to use the –Force switch to force the replacement of the existing disk because that’s exactly what we want.  We do not want to have duplicate disks attached to the VM, as well as the VM vmx file is already mapped to this name.  That’s the true power of the script.  We do not have to handle the rename of the new disk, nor do we have to worry about attaching the drive since the VM is already mapped to the existing name.

1
2
3
4
5
67
8
9
10
Connect-VIServer YourvCenter
$list = Import-Csv C:\scripts\csv\baddrive.csv
ForEach ($bad in $list)
{
Stop-VM $bad.Parent -Confirm:$false
Get-VM GOOD_DISK | Get-HardDisk | Copy-HardDisk -DestinationPath $bad.FileName -Force
Start-VM $bad.Parent
}

Note: vCenter does throw an error using the –Force switch.  Just safely ignore, the script continues to work like a champ.

 





Note 2:  Test HEAVILY



Monday, March 18, 2013

Restarting those hung PVS targets running on vSphere


Pretty excited to post this script as it was actually the first PowerCLI script I had ever worked on.  To this day it is still set as a scheduled task and runs daily in my production environment and works like a champ.   I knew PowerCLI was extremely powerful at that time, but had never taken the time to look at the cmdlets and what they had to offer.  It opened a whole new world as to how I manage a large virtualized infrastructure, and has really saved me a ton of time.
 
Anyway, back to the script and why I started it.  We had just recently deployed a fairly large XenApp farm deployed via Provisioning Server with the backend hypervisor at vSphere 4.   We scaled out using VMs and the solution overall seemed to work pretty well once we worked through some of the pain points of migrating from a large non provisioned Citrix environment.   However we noticed that every so often some of the guests would sort of lose their way back to the PVS boxes and would no longer accept new terminal connections.  Originally we would hard reset them, but then we realized we were kicking off active sessions (which wasn't good).

This sort of went unnoticed for a while, and soon we had several VMs in this state.  Terminal sessions would stay active, new sessions could not be substantiated, and console access would be pretty much locked.  Since these VMs stayed in this locked state, they were not subject to the nightly\early morning reboots that are scheduled through XenApp.  This exacerbated into having a  subset of our provisioned VMs that weren’t usable and we were not really aware.  Not a huge deal since we scaled out, but not something you want to continue.
One day I was working on one of the servers in this half hung state and noticed that the VMware tools where shown as “Not Running” in the vCenter console.  Finally I had something to “key” on to identify these servers.  Then the wheels started spinning on how to resolve this.  I had recently attended a VMware User Group meeting promoting this tool called the vEcoShell.  I was amazed by the power it possessed.  After realizing it was a framework for running PowerCLI commands, I immediately dug in and started looking into how to resolve my issue.  I couldn't immediately restart servers once they reported their VMware tools as “Not Running” because it would kick legit users off.  I also didn’t want to continue to build up servers that were not getting regularly restarted and losing the ability to accept new users. 

The solution I came up with was to run a scheduled task to restart these lost souls right after our scheduled early morning XenApp Farm restarts.  Here is the script:
001
002
003
004
005
006
007
008
009
010

Connect-viserver YourvCenter
$hungVMs = get-Cluster -Name YourPVSXENAPPCLUSTER | get-vM |
where-object {$_.powerstate -eq "PoweredOn"| % {get-view $_.ID} |where {$_.guest.toolsstatus -like "*not*" }

foreach ($hungVM in $hungVMs)
{
Restart-VM -VM $hungVM.Name -Confirm:$false

}



Shown above is a simple but effective script.   It’s simply scanning the entire cluster for VMs that report as “Powered On” and -like of “*not*”.  The reason I chose to find objects using a like vs a match is that there is two different scenarios I saw that reported their VMware tools when the servers were hung.  “Not Running” and “Not Installed” are both valid states identifying servers that were hung and not responding to our scheduled restarts.  Using a -like will include both states, which works out great. 

After that it hard resets the VM, and the OS is hard reset, and it’s still within the reboot window.  The reason I can hard reset it…PVS brings it back to a pristine state FTW!  Please test this heavily before implementing.  



Tuesday, February 26, 2013

Getting those PowerPath VE ESXi hosts to check into the license server via PowerCLI

I have ran  PowerPath VE in my production ESXi environment and I am a huge fan of the product.  What I have seen in the past is that sometimes the hosts will lose their way to the licensing server, and not technically have a license checked out.  I don’t know if there is a risk to losing the benefits that PowerPath VE provides when a host is in this unlicensed state, so I wrote a simple batch script that would reregister the hosts.  This worked  in the past, but often times I would be negligent and not run it as frequently as I should. 

Coupled with the fact that we began to use VMware’s lockdown mode to fully secure our ESXi hosts, the batch file was no longer working.  Lockdown mode prevents anyone from directly authenticating directly to the ESXi host, therefore the built in lockbox sitting on the PowerPath VE server was no longer able to directly register the hosts as licensed.

I could manually disable each hosts lockdown mode and run the batch file again, but that’s pretty tedious and I am sure I would forget to do it.  In times like these I turn to PowerCLI and write a dependable script and set it to run in task scheduler and not worry about it again.

Since I do not have PowerPath VE installed on all of my hosts, I will just maintain a csv file with the host information as an input for the script.










The rest of the script is pretty basic.  The script is a PowerCLI\Powershell wrapper script.  I am importing the PPhosts.csv file so I can run a set of actions against each individual server sequentially.  For each host listed, I will first disable lockdown mode.  I dug the .ExitLockdownMode() and .EnterLockdownMode() out of the following KB article: KB article

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018

Connect-VIServer 'Your vCenter'

$var = Import-Csv C:\temp\PPhosts.csv

FOrEach ($guy in $var )

{

(get-vmhost $guy.Name -ErrorAction SilentlyContinue | get-view).ExitLockdownMode()

$1 = "rpowermt host="
$2 = " register"
$3 = $1 + $guy.Name + $2

cmd /c $3
(get-vmhost $guy.Name -ErrorAction SilentlyContinue | get-view).EnterLockdownMode()
}



I used the –ErrorAction SilentlyContinue so the script will not halt.  There is a known bug with vCenter when the status of the lockdown mode is listed incorrectly.  The “SilentlyContinue” switch will allow the script to continue to loop through the remaining hosts. 

Next I am doing a little concatenation to build the correct command per host in the csv file.  The end result should have $3= “rpowermt host=MYHOST1.host.com register”.  Using the cmd /c will execute the same command I previously ran in my batch file.  But instead with PowerShell as the wrapper, it will insert each host name or ($guy.Name) every time the loop is executed. 

In the end, I am just putting the host back in lockdown mode as it was before. The last step is to set the script to run as a scheduled task.  I do periodically test the scripts functionality and launch the script as a scheduled task to verify they are working correctly.