HP LeftHand Multi-Site SAN & VMware vSphere
Aug 25, 2010 Blogs
I’d like to warn you for a not-so-obviously documented fluke in the combination of a HP LeftHand (P4000) Multi-Site Cluster and VMware vSphere hosts on multiple sites.
The situation
HP LeftHand (P4000) Multi-Site Cluster
HP LeftHand Multi-Site Cluster requires the storage nodes to be divided into two subnets, one per site. Each subnet (or VLAN) requires a Virtual IP in that subnet.
Example
I.e.: VLAN 1 with Virtual IP 10.10.1.100 (/24) for site 1 which hold storage node 1 and storage 3 and VLAN 2 with Virtual IP 10.10.2.100 (/24) for site 2 which hold storage node 2 and storage node 4.
LUN Ownership
Whenever a LUN (or volume, in Lefthand’s CMC) on the LeftHand cluster is accessed by an iSCSI Initiator, that volume is bound to the Virtual IP (and thus subnet / VLAN) by which the LUN is accessed.
Example
I.e.: when an ESX-host on Site 1 rescans it’s vmhba38, the volumes presented to that ESX-server get bound to this site’s Virtual IP, which is in VLAN 1.
VMware ESX software iSCSI Initiator and multiple subnets
The VMware software iSCSI Initiator (and, by extension, the dependent hardware iSCSI Initiator) do not support accessing a iSCSI Target (or LeftHand Virtual IP) outside it’s own subnet.
Example
I.e.: the ESX-hosts on Site 2, which have their iSCSI adapter in VLAN 2, cannot traverse the network to access the LUN that is bound to VLAN1. They to a discovery to the Virtual IP in VLAN 2, but get redirected to the Virtual IP in VLAN 1, because that’s where the LUNs are bound to. You’ll see errors in /var/log/messages about iscsid not being able to connect to the Virtual IP in VLAN 1.
The solution
Multiple VMkernel ports for iSCSI in multiple VLANs
The only way to maintain synchronous replication (‘Network RAID 10′ in LeftHand naming convention) is to maintain the HP LeftHand Multi-Site Cluster. This means maintaining the multiple VLANs for iSCSI. To enable hosts on site 2 to access the LUNs, you’ll need to configure additional VMkernel ports for iSCSI in the other VLAN. So in addition to the two VMkernel ports in VLAN 2 on an ESX-host in site 2, you’ll add two VMkernel ports in VLAN 1 on that ESX-host. On the ESX-hosts in site 1, you’ll add two VMkernel ports in VLAN 2.
The impact of this change is additional iSCSI traffic over the intersite link: not only does the replication traffic travel between sites, now also iSCSI traffic from the ESX-hosts in site 2 travels the intersite link to the storage nodes in site 1.
Another change in this environment is the added complexity to complete a site failover. You’ll need to rescan all software iSCSI adapters after the primary site has failed, because the LUNs are still attached to the (now failed) Virtual IP of the primary VLAN. By rescanning, the LUNs are attached to the Virtual IP of the other VLAN. If site 2 fails, only the replication is lost, as site 2 doesn’t do any iSCSI traffic with any of the ESX-hosts but merely does Network RAID 10 replication.
If you dare create a single vSphere Cluster, spanning HA/DRS across sites, VMware HA can take care of VM failover. If hosts on site 1 fail but storage is still alive, a rescan on the hosts on site 2 is needed before HA can restart the VM’s. You’d better be quick with that rescan! If hosts on site 2 fail, no rescan or other action is needed, as HA will restart VMs in the ESX-hosts in site 1 (because ESX-hosts in site 1 have an active session with the LUNs).


October 10th, 2010 at 11:03
I wonder if this is a correct “best practise” setup:
In this document (http://h20195.www2.hp.com/V2/GetDocument.aspx?docname=4AA3-0261ENW) they’re stating clearly:
“It is important to note that native vSphere 4 multi-pathing cannot be used with HP P4000 Multi-Site
SAN configurations that utilize more than one subnet and VIP (virtual IP). Multiple paths cannot be
routed across those subnets by the ESX/ESXi 4 initiator.”
and here (http://h20195.www2.hp.com/V2/getdocument.aspx?docname=4AA0-4385ENW&cc=us&lc=en) they’re saying:
“It is preferred to have the iSCSI traffic, VMware FT logging and VM traffic on
its own subnet to help address any networking challenges especially when working with layer 3
switching. HP best practices recommend implementing rapid spanning tree and deploying layer 2 switching.”
So it seems the one-subnet setup is prefered over a multi-subnet?
On the other hand, I was under the impression that the next version of SAN/IQ would be location aware. Meaning that the storage initiator (ESX) communicates with the closest storage node (independend of the virtual IP to which the LUN is bound). But I can’t see how location awareness can be achieved without a multi-subnet config.
Anyway, it seems networking is the big challenge when designing a multi-site P2000/vSphere setup.
October 11th, 2010 at 10:37
For a vSphere set up, I’d prefer a single subnet (and thus a single Virtual IP). However, a Multi-Site Lefthand cluster simply requires two subnets.
I sure hope VMware fixes problem with the iSCSI Initiator not being able to connect to a target on a different subnet than the VMkernel, besides HP/LeftHand making SAN/iQ location-aware..
October 11th, 2010 at 11:01
I’ll run this past HP as they’re in the process of speccing up a full P4000 cluster with servers to do a multi-site.
One of their own webex’s done with vmware says that single subnet is best practise, then their documentations says use multiple subnets.
With Joep’s solution it seems you can use multi-pathing but how reliable is the failover within vSphere if you have it set to use the VIP’s in both sites for discovery?
October 12th, 2010 at 9:57
Failover should work as if you only have a single SAN, a single vSphere cluster with HA. This means: no manual tasks to be performed whenever a site (server) goes down, and only a couple of minutes of downtime for the VMs.
October 12th, 2010 at 19:58
1) Isn’t that contradictory?
You’ve stated in the article that a “rescan” is needed! That is a manual task isn’t it?
2) Just to optimize things as for lowering the traffic over the intersite link… wouldn’t it be possible to have 2 LUN’s?
One, exclusively containing VM’s running on the ESX 1 (site 1), and another one containing VM’s running on the ESX 2 (site 2)?
That way, you can make VIP 1 owner of LUN 1 and VIP 2 owner of LUN 2.
There is still replication traffic, but both ESX’es connect to their closest VIP.
HA would still work, DRS would be a “bad” idea.
October 13th, 2010 at 13:59
Vincent,
the ‘rescan’ of storage adapters can be automated using a scheduled task in vCenter.
Using multiple LUNs and running all VMs on that LUN in a single site seems like a good idea, and you can use DRS, by using Host Affinity rules, with which you can create some artificial ‘boundaries’ to define a ‘site’.
October 13th, 2010 at 18:18
I think from a chat with Joep and piecing together lots of other sources, for a two site, this is the ideal setup:
Site A
Server A (iSCSI NICs in VLAN1 and VLAN2)
Switch A
Multi-Site Cluster 1 VIP 192.168.0.1 (VLAN1)
|
|10gbps link with VLAN Tagging for VLAN1 and VLAN2
|
Site B
Server B (iSCSI NICs in VLAN1 and VLAN2)
Switch B
Multi-Site Cluster 2 VIP 192.168.1.1 (VLAN2)
I have to admit I am still a little confused by how much of the vSphere HA and rescan stuff is automatic and how much is manual – I thought the whole point of specifying multiple paths was that vSphere could work out for itself to use any available path to the same LUN?
October 14th, 2010 at 12:08
It doesn’t seem a good idea to schedule a rescan… what timing would be a good practise?
Too many rescans will impact the performance, too few rescans and the time-out will be too long.
Fe: 1min.
I don’t think some application can survive that long without IO.
October 14th, 2010 at 16:21
Scheduling a rescan isn’t a fantastic idea, agreed.
October 14th, 2010 at 17:23
If you have two sites with direct 10gbps fibre between switches (so no routing needed) is there a reason you MUST use two subnets and you could not just use one subnet for all iSCSI in both sites?
I thought the whole idea of multipathing is that you don’t have to rescan if a path fails, it detects it and does it by itself – the talk of scheduling rescans is confusing.
October 15th, 2010 at 19:14
The two subnets are simply required for the Multi-Site setup.
If you don’t use the multi-site setup but use a single cluster instead, there’s no guarantee your data is redundant between the sites…
October 16th, 2010 at 15:31
Not sure if we’re using different terminology for the same thing, but from speaking with a Lefthand technical consultant yesterday they said you can configure sites in P4000 whilst still using a single VIP.
So in our situation we should be able to use two switches with a single 10gb link between (no routing) and a single subnet/VIP for the P4000 cluster, but still have Site A and Site B defined and the P4000 will ensure the data is striped correctly.
October 16th, 2010 at 17:11
@Joep: I don’t agree with you when you say two subnets are required for Multi-site setup. I have a number of costumers who have a dedicated fibre to span one subnet over two sites.
@Paul: You might want te read this: http://frankdenneman.nl/2009/10/lefthand-san-lessons-learned/
To ensure your data REALLY redundant in a 2-way 2-site cluster, you must take care of the order of those boxes.
(and there is also a good explanation of VIP’s and gateways(VIPLB) in that post)
I’m more and more convinced the one-subnet solution is the best one (at least as long VMWare doesn’t support routing for it’s ISCSI trafic).
The downside is that you might get a lot of trafic over the interlink, but you might be able to circumvent this by creating multiple VMFS LUNs based on the site they will be used most.
Another big downside is that your interlink is getting critical. In case of interlink failure, one site will go down completely(this depends on the placement of the FOM).
So you might want to think about extra redundancy on the interlink.
October 16th, 2010 at 19:58
@Vincent – thanks for that, an interesting read and highlight the main thing with single subnet multi-site – get the nodes in the right sites!!!
We’ll go with a 10gbps link, our physical site layout means any amount of redundant links all pass through the same conduit sadly, we’re stuck with that and we (as IT) can’t change it.
So the current thing I’m trying to figure out is what if the iSCSI link fails, but the host(s) and primary LAN are still up?
HA doesn’t seem to cater for the storage disappearing from a host?
October 17th, 2010 at 12:55
Vincent, Paul: I think you both might have found something I completely overlooked: even in a non-multi-site cluster (i.e. a normal cluster), you can define multiple sites, thus circumventing the need for two subnets/VIPs, while still maintaining data redundancy. I will test this out in a lab environment and post the results. Thanks!
October 17th, 2010 at 16:56
Joep, the HP documentation could be much better and of course you may be in a situation where your network is routed, no VLAN’s etc. so you have no choice but to use two subnets, but yeah as you say, you can define sites within a single subnet so you can have a “stretched” cluster but maintain redundancy.
Incidentally one new thing I didn’t know (and this isn’t unique to Lefthand) is apparently MPIO from within a guest is not supported by VMware – I certainly didn’t know that.
October 17th, 2010 at 21:23
neither do I know that. That is certainly something to keep in mind when designing an environment where bigger application servers (SQL, Exchange, Sharepoint, etc) make a direct connection to the SAN LUNs…
October 18th, 2010 at 13:29
Be interested to know if you have any contacts in vmware or other vmware bloggers who can shed a little more light – to me guest MPIO seems an obvious thing to want to do and I’m sure lots of people use it.
October 18th, 2010 at 21:08
One other thing on the Multi-Site or Normal Cluster config:
If you create a cluster (selecting ‘Multi-Site Cluster’), you’ll be forced to use 2 Virtual IP’s and thus 2 subnets/VLANs. If you create a cluster (selecting ‘Normal Cluster’) and creating two sites manually, you aren’t required to use two Virtual IP’s (and thus 2 subnets/VLANs). Both enable you to use ‘Network RAID-10 (2-way mirror)’ as a volume type. With only a single Virtual IP, storage networking (i.e. iSCSI on the VMware hosts) is simplified.
October 19th, 2010 at 0:13
@Paul:
Q:
So the current thing I’m trying to figure out is what if the iSCSI link fails, but the host(s) and primary LAN are still up?
HA doesn’t seem to cater for the storage disappearing from a host?
Possible A:
Thanks to the FOM, one storage Lefthand box will remain online, and the other will automaticly freeze (stop IO).
The site of the online box will continiue working. The ESX on the other site will lose storage, so normally HA should kick in and start those machines on your “live” site.
Of course, if the complete link is down (also datacom trafic), this mains that nobody of the “offline” site will see the server anymore, because it’s running on the “live” site.
October 19th, 2010 at 18:47
Thanks Vincent, that’s what I didn’t know and what is so hard to test.
I thought that HA simply said “Host is up, that’ll do for me”, but if it says “Host is up, but I can’t mount the VM’s, you do it” that sounds like what I’m looking for (it’s a bit late to test once you’ve purchased the kit!).
Thanks!
October 20th, 2010 at 10:58
@Paul:
You might have a point. I was under the impression that the host (losing his storage) will go into isolation mode. And HA would kick in.
I checked some documentation and I may be wrong as “isolation” is only documented as a host losing his network connectivity.
So actually: I don’t know.
It seems logical and doable that HA would kick in, but I’m not sure.
This is something that should be tested. I won’t be able to test this in de near future. Maybe Joep can help u with that?
October 24th, 2010 at 6:21
In order to accomplish the iSCSI rescan why not set an action on the host disconnect alarm to run a script that does the rescan on the remaining hosts.
October 25th, 2010 at 13:32
Robert: that’s not a bad idea, I will test this as soon as I can. Thanks!
October 27th, 2010 at 15:41
@Paul:
Apparently, storage failure is not an option:
http://communities.vmware.com/message/1630350
October 27th, 2010 at 19:25
Thanks Vincent – that thread was me, I get everywhere when I can’t find something documented
November 8th, 2010 at 4:41
With iSCSI having more overhead and VMware also supports NFS, would configuring LeftHand Storage with VMware ESXi via NFS would be a better choice and we might not even have the problem on the multiple subnets?
Correct me if I am wrong.
November 8th, 2010 at 11:11
Hi Yih. As far as I know, Lefthand storage does not server NFS mounts. It is an iSCSI-only box.
November 8th, 2010 at 15:52
It’s a small world
November 10th, 2010 at 6:51
Hi Joep
http://h10010.www1.hp.com/wwpc/us/en/sm/WF05a/12169-304616-3930449-3930449-3930449-4118659.html
Looking at the above link it does support NFS protocol.
Boosts the value of P4000 SAN Solutions by adding Windows-powered IP-based gateway services. Fully featured HP StorageWorks P4000 Unified NAS Gateway includes support for multiple file protocols (CIFS, NFS, etc.), deduplication, print services and high availability through Microsoft Clustering.
November 11th, 2010 at 11:35
Nah, not really. The NAS gateway is a Windows Server serving NFS and CIFS. The disks at the back-end are LUNs on the P4000 iSCSI SAN. Thus, the P4000 doesn’t actually do CIFS or NFS. You need additional components that translate SCSI-based disks to CIFS/NFS…
December 2nd, 2010 at 5:32
Another solution for Multisite access is to have routing enabled between iSCSI subnets and add static routes on each host and storage to default gateways on iSCSI subnets.
March 1st, 2011 at 17:55
So I’ve been running through this with HP recently. They’re actually recommending collapsing my multi-site setup (two subnets) back down to one just by editing the IP address and gateway definitions on one set of the storage node and removing the second VIP. They’re saying the VIP failover should be quick enough, and that they’re starting to recommend this to customers who are stretching VMware clusters across sites.
One other possibility that’s been suggested (besides the unsupported iscsi gateway routing trick – I’d be concerned that would break in a future version of vSphere, since v5 is supposed to remove the service console OS and we don’t know that they’ll add support for multiple vmkernel gateways) is to actually use adapter failover. Set each vmk iscsi portgroup to use the other adapter as “standby”, and if a link goes down we’d be OK. A single storage switch failure is probably more likely than a site failure in our case. We don’t need vlan tagging because we have two subnets in a single vlan.
Not sure how I feel about that adapter failover option. Part of it is that we’re going to be using a pair of Novell SLES 10 SP3 / OES2 file servers as a cluster, and whatever I end up doing for the VMware hosts needs to work with the Linux iscsi initator as well.
March 3rd, 2011 at 21:27
Hi Paul,
Bringing it down from a MultiSite with two subnets and VIPs to a single subnet/VIP is exactly the same advice I got from HP.
Adding a second vmnic to a VMkernel Port Group for iSCSI seems a bit iffy, I wouldn’t go down that route wholeheartedly..