Monday, March 18, 2013

Restarting those hung PVS targets running on vSphere


Pretty excited to post this script as it was actually the first PowerCLI script I had ever worked on.  To this day it is still set as a scheduled task and runs daily in my production environment and works like a champ.   I knew PowerCLI was extremely powerful at that time, but had never taken the time to look at the cmdlets and what they had to offer.  It opened a whole new world as to how I manage a large virtualized infrastructure, and has really saved me a ton of time.
 
Anyway, back to the script and why I started it.  We had just recently deployed a fairly large XenApp farm deployed via Provisioning Server with the backend hypervisor at vSphere 4.   We scaled out using VMs and the solution overall seemed to work pretty well once we worked through some of the pain points of migrating from a large non provisioned Citrix environment.   However we noticed that every so often some of the guests would sort of lose their way back to the PVS boxes and would no longer accept new terminal connections.  Originally we would hard reset them, but then we realized we were kicking off active sessions (which wasn't good).

This sort of went unnoticed for a while, and soon we had several VMs in this state.  Terminal sessions would stay active, new sessions could not be substantiated, and console access would be pretty much locked.  Since these VMs stayed in this locked state, they were not subject to the nightly\early morning reboots that are scheduled through XenApp.  This exacerbated into having a  subset of our provisioned VMs that weren’t usable and we were not really aware.  Not a huge deal since we scaled out, but not something you want to continue.
One day I was working on one of the servers in this half hung state and noticed that the VMware tools where shown as “Not Running” in the vCenter console.  Finally I had something to “key” on to identify these servers.  Then the wheels started spinning on how to resolve this.  I had recently attended a VMware User Group meeting promoting this tool called the vEcoShell.  I was amazed by the power it possessed.  After realizing it was a framework for running PowerCLI commands, I immediately dug in and started looking into how to resolve my issue.  I couldn't immediately restart servers once they reported their VMware tools as “Not Running” because it would kick legit users off.  I also didn’t want to continue to build up servers that were not getting regularly restarted and losing the ability to accept new users. 

The solution I came up with was to run a scheduled task to restart these lost souls right after our scheduled early morning XenApp Farm restarts.  Here is the script:
001
002
003
004
005
006
007
008
009
010

Connect-viserver YourvCenter
$hungVMs = get-Cluster -Name YourPVSXENAPPCLUSTER | get-vM |
where-object {$_.powerstate -eq "PoweredOn"| % {get-view $_.ID} |where {$_.guest.toolsstatus -like "*not*" }

foreach ($hungVM in $hungVMs)
{
Restart-VM -VM $hungVM.Name -Confirm:$false

}



Shown above is a simple but effective script.   It’s simply scanning the entire cluster for VMs that report as “Powered On” and -like of “*not*”.  The reason I chose to find objects using a like vs a match is that there is two different scenarios I saw that reported their VMware tools when the servers were hung.  “Not Running” and “Not Installed” are both valid states identifying servers that were hung and not responding to our scheduled restarts.  Using a -like will include both states, which works out great. 

After that it hard resets the VM, and the OS is hard reset, and it’s still within the reboot window.  The reason I can hard reset it…PVS brings it back to a pristine state FTW!  Please test this heavily before implementing.