I’ve experienced weird errors and faults with random virtual machines in a 5-host HA/DRS cluster. During a vMotion or after restarting Management Agents on the ESXi-hosts, the virtual machine becomes unmanageable. Tasks, like vMotion, become stuck at the 82% mark or an ‘invalid’ state is detected by vCenter because it doesn’t know where the VM is actually running  anymore.

Symptoms

Let’s take an example I experienced this morning:

    • a virtual machine ‘VM01’ was running on ESX04.
    • DRS invoked a vMotion of this virtual machine to another host, ESX02.
    • The vMotion task remained stuck at 82% for over two hours.
    • The VM itself still responded to ping and RDP-connections
    • When viewing the hosts’ inventory (directly connected the vSphere Client to the two hosts), neither had the virtual machine registered and running.
    • Using ‘esxtop’, I discovered that the virtual machine was actually running on ESX02, so the vMotion did complete successfully.
    • hostd.log on the destination host stated:

[‘vm:/vmfs/volumes/4c713f9b-9b799538-a0f9-78e7d1f7b584/VM01/VM01.vmx’] connect
to /var/run/vmware/root_0/1299491418874549_26415352/testAutomation-fd:
File not found

Resolution

Duco Jaspars gave me a helping hand and pointed me to VMware KB 1033591

    • The Virtual Machine’s vmware.log stated:

vmx| /vmfs/volumes/4c713f9b-9b799538-a0f9-78e7d1f7b584/VM01/VM01.vmx: Setup
symlink /var/run/vmware/1a79f8eb60648d355323caa9ae6e4cae
-> var/run/vmware/root_0/1299491418874549_26415352

    • This VM did have an entry, so I skipped to step 4 of the resolution

Contents of /var/run/vmware/root_0/*
root_0/1299491418874549_26415352:
configFile -> /vmfs/volumes/4c713f9b-9b799538-a0f9-78e7d1f7b584/VM01/VM01.vmx

    • I recreated the symlink with

ln -s /var/run/vmware/root_0/1299491418874549_26415352
/var/run/vmware/1a79f8eb60648d355323caa9ae6e4cae

  • I restarted the Management Agents on the affected host
  • After reconnecting the host in vCenter, the VM is fully operational (the vMotion task failed at the moment I restarted the management agents) again.

Weird thing about this situation is that this happens with random virtual machines and only so often. Most vMotion tasks complete successfully, while some (lets say 1 in 20) experience this problem.

Duco told me that he experienced this problem on HP hosts using B-spec SD-cards for the ESXi hypervisor, but VMware Technical Support stated that they’ve seen this with hosts using local (SAS-)disks as well.

Has anyone experienced this as well? Are you using HP ProLiant DL360/380 G7 hosts? I’m very curious if you found a permanent solution for this..