Together with some friends, I’m running the KLAUWD.com (imagine the Governator saying ‘cloud’) IaaS hosting service. It’s a two-server vSphere deployment, where I host some thirty virtual machines. Since I don’t have any shared storage in this environment (and don’t want any additional hardware, to keep cost down), I’ve been looking at numerous virtualized shared storage solutions like:

  • HP LeftHand P4000 VSA
  • VMware vSphere Storage Appliance
  • Nexenta NexentaStor

I had the P4000 VSA running for a couple of years, but it was time for a change-up. It’s not that I disliked the P4000, it’s just that I wanted a bit more performance. The old P4000 setup maxed out at about 80 I/Ops with two RAID-10 sets consisting of a total of four SATA drives. In its defense, it did do synchronous replication.

After attending the OpenStorage Summit 2012 in Amsterdam a while back, I decided to take NexentaStor for a spin. Their ‘SSD-as-cache’ (ZIL or L2ARC) was what got me interested, initially. As I dug deeper, I saw some other features I really liked, such as ZFS snapshotting and replication.

I drafted a “I want these features” list:

  • File-based (NFS) storage
  • KISS architecture; no heartbeat or other HA functionality inside the virtual storage appliances (because there’s no shared storage between ‘controllers’ or Nexenta instances; each VSA has its own set of drives)
  • Data availability using asynchronous filesystem-based snapshots and replication
  • Data performance using SSDs as cache using a transparant tiering method
  • Data recoverability using Veeam Backup & Replication (which is installed offsite) and optionally cloud based storage buckets (like Amazon S3, Google Cloud Storage, any remote and publicly available Nexenta box)
  • Maintain data integrity (using ZFS features like checksumming); I’ve had data corruption due to power failure on the P4000 platform before.
  • Optional integration with the vSphere platform (for semi-object based storage) and VAAI-NAS optimizations
  • “Local-only” data transport
In the next couple of posts, I will dive into these bullet points, explaining the possibilities, how I implemented it and why I did so. I’ll start it off with the first four bullets:

Passthrough storage controller, VMDirectPath and NFS

I wanted the layer cake of filesystems as small as possible. Previously, I had [insert VM Guest OS filesystem] in a VMDK, on a VMFS, in an iSCSI container and/or VSA filesystem in a VMDK on a VMFS on a RAID-set on physical disks. If anything goes wrong, there’s just no way of salvaging it. I needed that to change.

I flashed my physical RAID-controller (a Dell PERC h100) to an LSI-branded IT-firmware (read SovKing’s post if you want to know how), stripping the controller of its RAID-functionality and making it a passthrough adapter (or HBA). Next, I attached this controller directly to the Nexenta virtual machine using Intel’s VT-d and VMware VMDirectPath. Besides slashing a couple of layers in the filesystem cake, it promises the best performance compared to VMDKs or RDMs.

Next, I configured the required storage options inside Nexenta (volumes and shares) to present the storage as an NFS export instead of iSCSI. This slashes yet another layer; the zvol layer in Nexenta-speak is omitted by using NFS.

Instead of eight layers, I now only have [insert Guest OS filesystem] in a VMDK via NFS on ZFS on physical disks, which makes four layers. This should help in bettering recoverability and performance.

No highly available VSAs

With the P4000s, I had a Multi-Site Cluster with the Failover Manager. This was done to increase availability of the data volumes in case of the virtual storage appliances failed. I’ve actually had more downtime because of the FOM failing or misinterpreting the network and isolation state of the storage nodes. I wanted to change this to a scenario where there’s no ‘complexity’ inside the storage appliance. If it’s down, it’s down. The other node will happily continue to serve its slice of the complete data set, but will not try to bring the failed volume(s) online. Bringing those failed volumes back online is going to be a manual action. Since this environment doesn’t host anything mission critical (other than my blog), that’s o.k..

Also, there’s no shared storage between ‘controllers’ or Nexenta instances; each VSA has its own set of drives (the physical drives inside the Dell PowerEdge R310 boxes). There’s no JBOD / SAS link between the drives; there internal drives connected to the Dell PERC h100 storage adapter. Having an HA plugin doesn’t actually add anything because of this.

The only ‘high availability feature’ I’m using for the Nexenta VSAs is the Virtual Machine Automatic Startup feature in ESXi to make sure the VSA is booted immediately after booting the physical host. I have placed the Nexenta VM and VMDK on a separate 2.5″ SATA disk formatted with the VMFS-5 filesystem, which contains VMs required to boot the environment and make it accessible from the outside (thus, the two Nexenta VSAs and two virtual Vyatta routers).

Data Availability

Data availability in this context means ‘have the data cross-replicated’, so any instance of the data (measured in individual virtual machines) is available on both boxes. I wanted to go from synchronous replication on the P4000 (Network RAID-10) to asynchronous replication to improve performance. I only got around 80 I/Ops on the Network RAID-10 volume with the P4000s, and that was just not enough. Since we’re using ZFS, I wanted the asynchronous replication to be based on ZFS snapshots. Nexenta offers just that with the Auto-Sync plugin. In a next, I will share how I configured data replication between two boxes.

Data Performance using cache

I have recently upgraded the home lab and laptop to a couple of bigger SSDs. I now have two Intel X25-M 80GB SSDs just sitting there. I figured, why not use them for KLAUWD.com?

Deciding how to use the SSDs was possibly the hardest part of the Nexenta build. I have a unique set up: two “low vRAM” Nexenta VSAs with more writes than reads. I’ve monitored the read/write I/Ops mix in my environment for a month or two. Strikingly, it comes out as 25% read and 75% write. I didn’t expect that at all, but it is what it is.

This means I would want to use the SSD as ZIL cache, more so because 100% of those writes seem to be synchronous. I believe the ESXi NFS client forces synchronous writes, but cannot confirm that. Can anyone post a comment explaining it or linking to an explanation and background information? Thanks!
Anyway, with the SSD configured as ZIL log write cache,  I hope to boost the write performance to somewhere between 500-1000 I/Ops. Anything above that is just awesome and will make the platform even more robust and scalable.

With only 4GB of RAM per Nexenta VSA, I might have as much as 3GB used for ARC. With a working set size of about 100-150 GB (out of a total of 350GB of data), it means I have a read cache size of 1.5 – 3% of the working set size. That’s not a lot. I will monitor ARC performance over the next couple of weeks to determine the impact of such a small ARC size. I can’t realistically add a 2nd SSD to be used as L2ARC (since the R310 simply can’t take any more drives), but I can bump up the amount of vRAM for the VSAs. We’ll just have to see how this one works out.

The three 3.5″ SATA disks inside the drive bays of the Dell PowerEdge R310 are placed in a RAID-Z1 (single parity) storage pool. I really wanted to create a mirror, but with a 250GB, a 500GB and a 1.5TB disk, I would only be left with 244GB of storage. With RAID-Z1, I would get 490 GB, which is enough to facilitate the current data set.

Since the Nexenta VM will run on the same hardware as the regular VMs, I tried to keep resource usage to a minimum. I disabled compression and deduplication on all volumes and folders with an 8KB record / block size. Also, the log bias is set to ‘latency’ and the access time update parameter was disabled.

Concluding

Using a couple of treats available to virtualized Nexenta boxes (IT firmware creating a pass through storage controller, attached via VMDirectPath), I was able to create a directly attached disk system exported using NFS to create the smallest layer cake possible. Without any physically shared disk systems (JBODs), there’s no need for any high availability plugin, keeping the design simple and the complexity low. Using Nexentas Auto-Sync plugin, I cross-replicated the data to enhance data availability and remove the dependency of slices of data (individual virtual machines) on the physical box. Lastly, I configured the SSDs as ZIL write cache to cope with the 75% writes and use the vRAM I have available for ARC cache to improve read performance. The RAID-Z1 pool with three SATA drives should provide enough capacity for all the virtual machines.

In the next post

In the next post, I dive into why I chose ZFS over any other file system, data recoverability with a two-fold approach and integration with the vSphere platform to optimize performance. I’ll also dive deeper into asynchronous replication with Nexentas auto-sync feature.