I am digging deeper into HP P4000 (LeftHand) SANs again. This has been a consistent side-project of mine for over two years now. I started using the VSA as a simple, fast and reliable appliance for a shared storage and SAN workshop I’ve delivered at my employers training center, and did a real life implementation of a production environment a while back.

I noticed that when designing a P4000 solution with a number of factors like Multi-Site, stretched HA/DRS clustering, MPIO combined with vSphere, you need to be extra carefull and really kick the tyres of the set-up in a lab to get to know all the gory details. Today, I tested a couple of those factors in the design.

Routed iSCSI

The first pitfall is an obvious one. VMware’s software iSCSI Initiator cannot do routed iSCSI. Multiple paths cannot be routed across those subnets by the ESXi 5 initiator. This effectively means that you can only do a P4000 Multi-Site SAN with a single Virtual IP.

With VMware ESX, the preferred configuration is a single subnet and with ESX servers assigned to sites in SAN/iQ. (source)

This actually has an advantage or two, too. Failover between sites is transparent; the ESXi-hosts do not notice site failure. Assuming your hosts didn’t go down as well during the failure, obviously.

Multipath I/O with vSphere

One of the things that caught my eye is a quirk with round-robin MPIO in a vSphere environment.

A P4000 stripes data in a volume across storage nodes in a site in a RAID0-fashion: each storage node hosts an equal part of the volume’s data. I tested this today by creating a simple three node cluster, creating a volume and writing data to that volume. Before the test, each of the three nodes had 1,37 GB provisioned. After I had written some data, all three nodes had 1,65 GB provisioned. This proves data is written in equal sizes chunks to each of the nodes.

When using MPIO in a vSphere environment, each volume only has one Gateway Connection:

But when using the P4000 DSM for MPIO on a Windows machine, each volume has a Gateway Connection to each of the storage nodes (and an additional ‘control’ session). If the host is assigned to a ‘site’ in the CMC, it’ll only connect to storage nodes in the local site.

This difference in how the connections between a host and the storage nodes for each volume is made dictates how I/O a given volume flows to the SAN:

With vSphere, all I/O goes through the node hosting the Gateway Connection. ‘s1v3’ (third VSA in the first site) is hosting the Gateway Connection and sends back-end I/O to other (local) nodes to fetch or write data. This makes a single storage node the bottleneck for I/O for a specific volume; and makes balancing which storage node hosts the Gateway Connection for a given volume very important: you’d want every storage node to hosts an equally spread number of Gateway Connections. I would recommend creating one or more volumes for each storage node. In a six-node storage cluster, create six (or twelve) volumes.

When using VIP and load balancing, one iSCSI session acts as the gateway session. All I/O goes through this iSCSI session. (source)

While with Windows and the P4000 DSM for MPIO, I/O flows from the host to each storage node. In this case, there’s no problem with a single storage node hosting a Gateway Connection like with vSphere.

In the graphs, you can clearly see a different I/O pattern for a vSphere (ESXi 5) host and a Windows machine using the HP P4000 DSM for MPIO. There is a down-side to using the DSM: the number of iSCSI sessions from a host is much higher, possibly up to the maximum number of iSCSI sessions supported by the OS and in fact limiting the number of servers or volumes.

Note: The difference in speed is irrelevant, but probably caused by the fact that I’m running six VSA’s and a Windows VM on a single laptop. I wanted to show how ESXi only utilizes a single path while Windows uses all paths; the goal of these measurements where not to compare speed (MB/s).

Which node will host the Gateway Connection for a volume?

With a Windows server with the P4000 DSM for MPIO, this is superfluous. It simply doesn’t matter which host will host the Gateway Connection for a host. But for an ESXi-host, it does.
This brings me to one of the worst bits of the HP P4000 SAN (and especially its documentation!): with VMware vSphere, there’s no real way of influencing SAN/iQ’s decision which storage node will host the Gateway Connection for a volume. In fact, creating ‘sites’ in the Centralized Management Console (CMC) and placing both ESXi-hosts and storage nodes in a specific site will limit SAN/iQ’s choice of storage node to host the Gateway Connection: only storage nodes local to the ESXi-host first connecting to the volume can host the Gateway Connection.

This is a very bad thing. Using two scenario’s, I’ll explain why:

Site failure

Imagine all the Gateway Connections roles balanced evenly over all the storage nodes. Your environment is functioning perfectly. Suddenly, the primary site fails. Gateway Connections for all volumes that were bound to a storage node in this site are transparently failed over to the secondary site. The VMs that went offline during the power outage (since the ESXi-hosts went down as well) will be restarted using VMware HA. Everything works as expected, and your design has saved the day.
That is, until you restore the primary site. Since there’s no way for the administrator to manually failback the Gateway Connections to the primary site to restore the evenly balanced pre-failure situation, the volumes are stuck on the secondary site. In my experience, there is simply no way to recover from this, other than messing about with reboots of the storage nodes hosting the Gateway Connections. Obviously, that’s not ideal.  Only when the storage node hosting the Gateway Connection goes offline, the Gateway Connection is moved.

There’s no other manual way a Gateway Connection can be moved, as far as I know. The ‘RebalanceVIP’ doesn’t rebalance Gateway Connections, although a lot of people claim rebalancing should happen automatically (in SAN/iQ 9 and up, anyway), I haven’t seen this work for a vSphere environment. Stopping managers on the storage nodes might actually do the trick, but I have not seen consistent failover of Gateway Connections. Lastly, you could run more managers (and designate your primary site as ‘primary’). This might make SAN/iQ prefer to host a Gateway Connection on a storage node on the primary site. If you have more information on how designating a site as primary will affect Gateway Connections, please reach out to me. Leave a comment below or ping me on Twitter.

First boot

The other scenario is just as bad. Imagine you are booting your environment for the first time, or you are recovering from a total shutdown (all storage nodes went offline). The first ESXi-host to connect to a volume will bind the Gateway Connection to a storage node in the local site. And since an ESXi-host usually accesses all datastores, all volumes are bound to a storage node in the local site, leaving no volumes left for the ESXi-hosts on the other site to connect to locally. Also, this limits the performance capacity of the SAN: only storage nodes on a single site will handle active I/O. The storage nodes on the other site will only do Network RAID-10 I/O replicated to them.

To avoid this problem you can do a little magic trick. Before you boot up the ESXi-hosts, remove them from both sites in the P4000 CMC. This will ensure an even spread of Gateway Connections between storage nodes on both sites, because even though the first booted ESXi-hosts connects to all volumes, SAN/iQ will balance Gateway Connections evenly across all storage nodes, regardless of their site affinity. Afterwards you can restore site affinity for the ESXi-hosts, but make sure to doublecheck the Gateway Connections first.

Site-affinity for Virtual Machines

So, we’ve now established that only a single storage node will handle I/O for a given volume. That means that all virtual machines’ I/O to that volume is handled by that storage node. If you’re running a Multi-Site SAN and a stretched vSphere HA/DRS cluster, you might run into trouble: an ESXi-host on site 1 can use a storage node on site 2 for virtual machine I/O. And that I/O gets replicated (via the Network RAID-10 synchronous replication in a Multi-Site SAN) back to a storage node in site 1. That means double the I/O is flowing over the intersite link, and potentially leading to high latency.

The solution is rather simple and  just a tiny bit elegant:

DRS Groups and Rules

By grouping all the virtual machines in a single volume into a DRS Virtual Machine Group, grouping the ESXi-hosts on a site and creating a vm-to-host DRS ‘should’ rule, you can enforce virtual machines to run locally. There’s one big catch though: since we cannot control which storage node hosts the Gateway Connection for a volume, so we have to manually find out which storage node (and thus which site) is hosting the Gateway Connection for a volume. We cannot guarantee that this mapping of a group of virtual machines to a site will match real life. I’m sure @PowerCLIMan can create a script that runs as a scheduled task inside vCenter to correct for this, though.

How to prevent all this Gateway Connection stuff

Simple. HP needs to create a VMware PSA multipathing plugin (MPP) for the P4000. As seen with the DSM for MPIO on Windows, the P4000 is working fine with a special plugin from HP. We just need this plugin on VMware.

On the other hand, you could do the following:

Leave the storage nodes assigned to sites; this removes the ‘random’ Network RAID-10 issue where a volume might be synchronously replicated within a single site, leaving the other site without a replica. Instead of placing the ESXi-hosts in sites, just leave them unassigned. This way, SAN/iQ does not know where the ESXi-hosts reside, and will happily allow all hosts to connect to all storage nodes. It’s still a limitation of one storage node handling all active I/O, and you’ll increase the amount of I/O travelling between sites, because you now will almost surely have ESXi-hosts on site 1 connecting to storage nodes on site 2 randomly. This could increase latency, too, because an I/O between a host in site 1 travels to the storage node in site 2. This storage node replicates this I/O back to a peer storage node in site 1.