As is typical in the world of IT, problems have a way of sneaking up on you when you least expect it, then viciously attacking you with a Billy-club. Often this happens when you are asleep, on vacation, severely inebriated, or have already worked 40-hours straight with no sleep. In my case, Super-Bowl Sunday at around 8:30pm was my time to get the stick. And get it I did.
For reasons too sad to warrant comment, and far too irritating to explain in a family forum like this, our ESX host servers all became disconnected from our SAN array. The root problem was something else on layer-2, and got resolved quickly, but the virtual world was not so quick to recover. In retrospect, the problem was not a bad one, but when you’ve been drinking and can’t see the obvious answer you tend to dig the hole you’ve fallen into deeper rather than climb promptly out.
By way of background, we are currently running VSphere 4.0, with a few servers having 32GB or memory and 8-cores, and a few having 512GB of memory and 24-cores. All ESX Hosts are SAN booting using iSCSI initiators on a dedicated layer-2 network. We use Nexus 1000v soft switches and have our ESX Hosts trunked using 802.1q to our Core (6506-E switches running VS-S720-10G supervisors). Everything is redundant (duplicate trunks to each Core switch) and using ether-channel with mac-pinning). So there you have that, for what it’s worth. Now back to the crashed servers.
We rebooted all of the ESX host servers, and with the exception of some FSCK-complaining they all came up quite nicely. The problem was that none of the virtual machines came up. Let me add that we have the domain controllers, DHCP, DNS, etc. on these hosts. Crap.
So the first thing I did in my addled state was to add DHCP scopes to the DHCP servers at another office across the country, and point the VLANs off “that-a-way” by changing the ip helper-address on each VLAN on the Core. That got DHCP and DNS back online. As you can probably guess by now, I was MacGyver-ing the situation nicely, but really didn’t need to. That’s one of the problems when you’re in the trenches: you tend to think in terms of right-now instead of root cause.
The next thing I did was to start bringing up the virtual machines one-by-one using the command line on the ESX hosts. Why? Because I had no domain authentication and the VSphere Client uses domain authentication. Here is where someone in a live talk would be interrupting me to point out that the VSphere Client can always be logged into using the root user of the hosts, even when domain authentication is set up for all users. Yes, that is true and it would have been handy to know at the time.
In order to bring up the virtual machines, I had to first find the proper name by issuing:
from the command line. This command can take a while to run, especially if you have a lot of VMs sitting around, so go get a cup of coffee.
Once I had that list I prioritized the machines I wanted up first, and issued the:
vmware-cmd //server-name.vmx start
command on each one. That should have been the end of the boot-up drama, but it wasn’t. As it turns out, a message popped up (and I don’t remember the exact phrasing) to the effect of “you need to interact with the virtual machine” before it would finish booting. So, now I issued the:
vmware-cmd //servername.vmx answer
command and got something that looked about like this:
Virtual machine message 0:
msg.uuid.altered:This virtual machine may have been moved or copied.
In order to configure certain management and networking features VMware ESX needs to know which.
Did you move this virtual machine, or did you copy it?
If you don't know, answer "I copied it".
0. Cancel (Cancel)
1. I _moved it (I _moved it)
2. I _copied it (I _copied it) [default]
Well, I didn’t know so I selected the default option (I copied it) and went on my way. That is fine in almost every circumstance and got all of my servers booted up. It did not, however, entirely fix the problem. In fact, even though all of my servers were booted, none could talk or be reached on the network.
This is where a little familiarity with the Nexus 1000v soft switches comes in handy. Very briefly, the architecture is made up of two parts: the VSM or Virtual Supervisor Module and the VEM or Virtual Ethernet Module. The VSM corresponds roughly to the supervisor module in a physical chassis switch, and the VEMs are the line cards. The interesting bit to remember for our discussion is that the VSMs (at least two for redundancy) are also Virtual Machines.
Some of you may have guessed already what the problem turned out to be, and are probably chortling self-righteously to yourself right about now. For the rest of us, here’s what happened:
I figured out the log-in-using-root thing and got the VSphere client back up and running (oh, not before having to restart a few services on the Virtual Center Server, which is not a virtual machine, by the way. I’m not totally crazy!). Once I got that far I could log in to the Nexus VSM, and look at the DVS to see what was going on. All of my uplink ports (except for ones having to do with control, packet, vmkernel, etc.) were in an “UP Blocked” state.
The short-term fix (again, the MacGyver job) was to create a standard switch on all hosts and migrate all critical VMs to that switch. That didn’t, however, fix the problem permanently and besides, we like the Nexus switches and wanted to use them. With that in mind, and a day or two to normalize the old sleep patterns, I set up call with VMware support. This actually took longer than I expected since I had to wait for a call-back from a Nexus Engineer, and they are apparently as rare as honest sales-people or Unicorns. That said, I did get a call back and we proceeded to troubleshoot the problem.
One thing that surprised me was that it took the Nexus Engineer a bit longer than I would have thought to find the problem, but even once he did it took longer to get resolution because we had to get Cisco involved. The problem, as it turns out was licensing.
When you license the Nexus, you receive a PAK and you use that to install the VSM. Once you do that, you have to request your license using the Host UID of the now installed VSM. Cisco then sends you a license key that you install from the command-line of the VSM. This is all somewhat standard and not surprising. What was surprising was that we would have to do this at all considering we had been licensed at the highest level (Enterprise, superdy-duperty cool or something) for years.
What happened was that the copy VSphere made in order to get each Virtual Machine back up after our crash changed the Host UID of the VSM virtual machine(s). Thus, the license keys were no longer valid and all host uplink ports went into a blocked state. (I’ll save you the obvious gripe I have with the Nexus not offering any kind of command-line message about our licensing being hosed.) This is where we had to get Cisco Licensing involved, as we had to send them the old license key files and the new Host-UID information so that they could generate new keys. Considering I was only on the phone with them for only 15 minutes, it was as pleasant an experience as I’ve ever had dealing with Cisco’s Licensing department. At least that’s something.
After fixing the licensing, the ports unblocked and I went through the tedium of adding back adapters to the Nexus, moving servers, etc. At the end of the day, however, it is all back to normal and working. There are a lot of lessons learned here, and you’ll no doubt pull your own, but the one overriding thing to be on the lookout for is that, under certain circumstances, if your Nexus VSMs are part of a crash and come back up, look to licensing first before troubleshooting anything else. Oh, and try to schedule your major system crashes for a more convenient time… when you’re sober. Just saying.