I’ve been having a lot of latency issues lately with two Dell PowerEdge R310s. These 1U boxes have a low end controller, an PERC H200. I’ve been having latency spikes in the range of 500-600ms, which is high enough to have the Linux VMs remount their filesystems in read-only mode continuously. This basically happens any time any of the VMs does moderate (say, 25+) iops, and causes the controller to lock up and take down multiple other VMs along the way. It also happens during any operation on the controller itself, like formatting a disk with VMFS, creating a snapshot, consolidating a disk or removing a snapshot.
As you can imagine, performance was abysmal, and something needed to be done. I’ve been monitoring guests to see if something inside the Guest OS caused the controller to skid out of control (I even down- and upgraded Linux kernels and changed guest filesystems), I have tried different mpt2sas driver versions, different advanced storage settings and many other things, but in the end, nothing really helped.
Until I spotted this post on the Dell Community by a user called ‘damirc’:
Well, found my solution.
It seems that the LSI SMI-S provider (the health provider for the vSphere console) is not too comfortable with Dell PERC H200 (or LSI 9211-8i) and seriously slows down disk i/o.
Worth a try, right? I removed the vib (‘lsiprovider’) and rebooted the host. And hey presto, I could easily push the SSD and H200 controller north of 4.000 iops with sub 10ms latency without any issue, which is pretty good in my view, and it certainly is a substantial improvement from the latency spikes and horribly low iops before. After a couple of hours of testing and monitoring, the previously mentioned issues seem to have completely disappeared by removing the SMI-S provider.
Despite Googling, asking around on Twitter and other research, I can’t discover any more information on this issue, other than some similar but unconfirmed cases (HP EVA SMI-S Provider Collection Latency Issue and Dell R210 II alternative SATA/SAS RAID controllers?).
I’m curious to hear from others that might have similar issues or from vendors that might have more information. Is it a confirmed bug? Does this apply to a specific vib version, a specific controller (OEM version, firmware version), or anything else? I’ve started threads on the VMware and Dell communities and submitted an SR with LSI (P00099195).
I will keep monitoring the issue and doing some more testing on a spare host that still has the issue to see if I can narrow it down. I’ll post an update here if appropriate.
Hi Joep, thanks for sharing this piece of information with the community. Does this problem apply to all LSI HBA’s and RAID controllers?
I have a PERC H700 raid controller with a BBU. Using this raid controller in combination with ESXi 5.5 I get horrible speeds. The PERC H700 is based on the LSI 9260.
If you have spare time, I’m willing to lend you this raid controller. I bought this raid controller for a cheap price, but it doesn’t perform well under ESXi 5.5
I bought this controller for my homelab.
Hi Sergio,
I’m assuming that this issue applies to all Dell PERC-branded LSI-based controllers and the LSI SMI-S provider. You can easily test this yourself by removing the LSI SMI-S vib from the host and rebooting. I ran esxtop for a couple of hours before and after and charted the values for DAVG/cmd, which showed significant improvement.
Have you guys had a chance to run 500.04.V0.56-0005 ? See if it helps.