I’ve been having a lot of latency issues lately with two Dell PowerEdge R310s. These 1U boxes have a low end controller, an PERC H200. I’ve been having latency spikes in the range of 500-600ms, which is high enough to have the Linux VMs remount their filesystems in read-only mode continuously. This basically happens any time any of the VMs does moderate (say, 25+) iops, and causes the controller to lock up and take down multiple other VMs along the way. It also happens during any operation on the controller itself, like formatting a disk with VMFS, creating a snapshot, consolidating a disk or removing a snapshot.

As you can imagine, performance was abysmal, and something needed to be done. I’ve been monitoring guests to see if something inside the Guest OS caused the controller to skid out of control (I even down- and upgraded Linux kernels and changed guest filesystems), I have tried different mpt2sas driver versions, different advanced storage settings and many other things, but in the end, nothing really helped.

Until I spotted this post on the Dell Community by a user called ‘damirc’:

Well, found my solution.
It seems that the LSI SMI-S provider (the health provider for the vSphere console) is not too comfortable with Dell PERC H200 (or LSI 9211-8i) and seriously slows down disk i/o.

Worth a try, right? I removed the vib (‘lsiprovider’) and rebooted the host. And hey presto, I could easily push the SSD and H200 controller north of 4.000 iops with sub 10ms latency without any issue, which is pretty good in my view, and it certainly is a substantial improvement from the latency spikes and horribly low iops before. After a couple of hours of testing and monitoring, the previously mentioned issues seem to have completely disappeared by removing the SMI-S provider.

Despite Googling, asking around on Twitter and other research, I can’t discover any more information on this issue, other than some similar but unconfirmed cases (HP EVA SMI-S Provider Collection Latency Issue and Dell R210 II alternative SATA/SAS RAID controllers?).

I’m curious to hear from others that might have similar issues or from vendors that might have more information. Is it a confirmed bug? Does this apply to a specific vib version, a specific controller (OEM version, firmware version), or anything else? I’ve started threads on the VMware and Dell communities and submitted an SR with LSI (P00099195).

I will keep monitoring the issue and doing some more testing on a spare host that still has the issue to see if I can narrow it down. I’ll post an update here if appropriate.