Update on missing symlinks (KB1033591)

I just received an update from my assigned HP support engineer:

I got some info from VMware, the latest on the Problem Report is that engineering are currently testing a fix for the issue. Once complete there will be a patch available for the issue. At the moment VMware cannot give a time line of when the patch will be available. From looking at the SR attached to the bug the issue only impacts ESXi images. Both installable and embedded had encountered the issue.
I’m handling another case where the issue is reported with the vanilla version. So till the time the patch is released we need to use the workaround specified.

I can confirm that this bug is present on all versions of ESXi (vanilla, vendor customized images, installable and embedded versions) and on different storage platforms (local SAS RAID, SD-card). I don’t know exactly which versions of ESXi are affected, but I believe that version 4.1 and 4.1 Update 1 are affected.

Disabling both sfcbd-watchdog and sfcbd on the ESXi-host seems to be the only work-around available, and is confirmed to eliminate the problem of the disappearing symbolic links.

 

More on VMware KB 1033591

The last couple of days, I’ve been working on a work-around or solution for missing symlinks making virtual machines unmanageable in the vSphere Inventory. In KB1033591, VMware has described a work-around to re-create the symlink when it has disappeared.

HP (and VMware) support have informed me on way to prevent the symlink from disappearing. I immediately posted this on Twitter. @vConsult, @gklooste, @Scorpinus and myself suspect we we’re experiencing the disappearing symlink only on hosts installed with the HP customized install CD, although this has not been verified to be the cause of this bug by VMware or HP. I’m not sure if the bug will not manifest itself when reinstalling the hosts with a vanilla ESXi CD or updating the host to 4.1 Update 1. I will try to verify if hosts installed using the vanilla ESXi installation CD are experiencing this bug as well.

Anyways, I wanted to share the instructions on how to prevent the symlinks from disappearing:

Engineering are aware and are currently investigating this issue. They have suggested the following work around to stop the issue occurring:
To workaround this issue:
Stop SFCBD on the ESXi host with the command:

/etc/init.d/sfcbd-watchdog stop

To make this change persistent on reboot, run these commands:

chkconfig sfcbd-watchdog off
chkconfig sfcbd off

Note: When you disable SFCBD, you cannot view the hardware status on the ESXi host.If hardware monitoring is an environmental requirement, you can extend the amount of time before the issue recurs:
Note: Depending on the system workload, this change may temporarily resolve the issue.
From the ESXi Shell, edit the

/etc/sfcb/sfcb.cfg

file using a text editor.
Search for the entry

provProcs: 16

and change the value from 16 to 12.
Run this command to restart sfcbd for the changes to take effect:

/etc/init.d/sfcbd-watchdog restart

Update: In the comments section below, Michael has pointed out that VMware KB 1035564 contains related information on disabling the sfcbd-watchdog:

“This issue occurs due to exhaustion of VMkernel resources. It occurs more frequently on hosts where OEM CIM providers have been installed, which may be caused by the added load of having CIM Providers under sfcbd.”

Update 2: After evaluation all nine ESXi-hosts in this environment, it seems that only the hosts installed with the HP customized version are suffering from this bug. Checking if the advanced settings ‘UserVars.CIMoemProviderEnabled’ is present on the hosts is a good way to determine if the host was installed with a HP customized version (value is present) or vanilla version (value isn’t present).

VM becomes unmanageable after vMotion

I’ve experienced weird errors and faults with random virtual machines in a 5-host HA/DRS cluster. During a vMotion or after restarting Management Agents on the ESXi-hosts, the virtual machine becomes unmanageable. Tasks, like vMotion, become stuck at the 82% mark or an ‘invalid’ state is detected by vCenter because it doesn’t know where the VM is actually running  anymore.

Symptoms

Let’s take an example I experienced this morning:

  • a virtual machine ‘VM01′ was running on ESX04.
  • DRS invoked a vMotion of this virtual machine to another host, ESX02.
  • The vMotion task remained stuck at 82% for over two hours.
  • The VM itself still responded to ping and RDP-connections
  • When viewing the hosts’ inventory (directly connected the vSphere Client to the two hosts), neither had the virtual machine registered and running.
  • Using ‘esxtop’, I discovered that the virtual machine was actually running on ESX02, so the vMotion did complete successfully.
  • hostd.log on the destination host stated:
  • ['vm:/vmfs/volumes/4c713f9b-9b799538-a0f9-78e7d1f7b584/VM01/VM01.vmx'] connect
    to /var/run/vmware/root_0/1299491418874549_26415352/testAutomation-fd:
    File not found

Resolution

Duco Jaspars gave me a helping hand and pointed me to VMware KB 1033591

  • The Virtual Machine’s vmware.log stated:
  • vmx| /vmfs/volumes/4c713f9b-9b799538-a0f9-78e7d1f7b584/VM01/VM01.vmx: Setup
    symlink /var/run/vmware/1a79f8eb60648d355323caa9ae6e4cae
    -> var/run/vmware/root_0/1299491418874549_26415352

  • This VM did have an entry, so I skipped to step 4 of the resolution
  • Contents of /var/run/vmware/root_0/*
    root_0/1299491418874549_26415352:
    configFile -> /vmfs/volumes/4c713f9b-9b799538-a0f9-78e7d1f7b584/VM01/VM01.vmx

  • I recreated the symlink with
  • ln -s /var/run/vmware/root_0/1299491418874549_26415352
    /var/run/vmware/1a79f8eb60648d355323caa9ae6e4cae

  • I restarted the Management Agents on the affected host
  • After reconnecting the host in vCenter, the VM is fully operational (the vMotion task failed at the moment I restarted the management agents) again.

Weird thing about this situation is that this happens with random virtual machines and only so often. Most vMotion tasks complete successfully, while some (lets say 1 in 20) experience this problem.

Duco told me that he experienced this problem on HP hosts using B-spec SD-cards for the ESXi hypervisor, but VMware Technical Support stated that they’ve seen this with hosts using local (SAS-)disks as well.

Has anyone experienced this as well? Are you using HP ProLiant DL360/380 G7 hosts? I’m very curious if you found a permanent solution for this..