Back in November, I attended Storage Field 4, and CloudByte was one of the vendors presenting. I had put CloudByte on my to-do list item to cover them in a blog post as their message of multi-tenancy and QoS grabbed my attention, and I wanted to dive into their product and cover their technical solution. After months of testing (and changing focus as they themselves changed focus; more on that later), I finally got around to writing up my experiences with the product. And I’m glad I did, as they have some pretty interesting ideas built into their ElastiStor product.

What is ElastiStor?

CloudByte is founded on the principle that nobody, not even the storage admin, really knows how much storage performance a given application needs. The legacy way of being (massively) over-provisioned to prevent performance and capacity bottlenecks doesn’t cut it anymore. In an era of tight budgets, not in the least for the very competitive service provider market and enterprise test and development environments, over-dimensioning is not the most elegant (or cost-effective) way of solving the problem.

Elastistor is FreeBSD- and ZFS-based and exposes the full ZFS feature set. ElastiStor OS is the storage software that runs all data services related to the physical infrastructure, such as HA clustering, replication and node coördination.

I highly recommend you to visit their Documentation page and read up on their product. It’s very well written and quite comprehensive. Also, recent CloudByte blog posts explain their offering pretty well: Give new dimensions to your Storage using CloudByte ElastiStor and You design the Hardware and CloudByte delivers it as an Intelligent Storage Box.

Hardware layer

When I started evaluating ElastiStor in early 2014, their delivery model was software-only. During Storage Field Day in November ’13, the delegates commented on the lack of bundled hardware. Since then, they’ve changed gears and changed direction to include hardware appliances. They’re now using this appliance-model, with standard x86 hardware and ElastiStor software bundled, called the ESA. The 2U ESA box comes in three different flavors with different availability, capacity and performance characteristics.

product-page

The back-end storage is provided by (switched) SAS, SATA (although not recommended for HA Groups) or FC JBODS. By attaching multiple physical servers to a JBOD, the back-end storage can be made highly available using standard ZFS methods with a limit of 4 nodes per HA group. A subtle note here is that the number of HA groups in a ElastiStor environment is virtually unlimited.While 4-node HA pairs is fairly limiting, the virtually unlimited number of HA Groups will allow it to scale. And because you can migrate data from HA group to HA group, I think there’s enough flexibility for this very early-stage product. I would like to see these limitations removed in an upcoming version, though.

Screen Shot 2014-01-29 at 22.26.17General sizing guidelines and limitations as far as the ZFS pools go apply unchanged. This means that you’re limited by JBODs, SAS fabrics, ZIL and L2ARC and node memory based on the hardware you have available. Elastistor can however query the back-end and indicate the maximum number of IOps. This is, however, only a best effort guess based some hardware characteristics and a couple of default metrics like read/write mix and random/sequential mix, and because of that, this value is very static and should only be used as a rough estimation. CloudByte recommends you doing your own real world performance tests to guestimate IOps and latency values for a disk pool.

These profiles can be changed to more realistic numbers after testing to better support specific workloads and CloudByte recommends you change these numbers for SSD tiers after testing, since there is still a lot of variation in SSD performance. They do however state that the number for SATA, Nearline SAS, SAS10k and SAS15k are pretty reliable.

Software architecture

Their software architecture consists of two major components:

  • ElastiCenter for management and integration, includes a REST API.
  • ElastiStor for data storage and data services, node and workload coordination.

The storage platform offers two differentiating features compared to other ZFS-based systems:

  1. Storage Quality-of-Service (QoS)
  2. Multi-tenancy and decoupling of back-end storage infrastructure and front-end storage consumers which enables vMotion-like storage flexibility

These features aim ElastiStor squarely at the (Cloud) Service Provider and large Enterprise (although mainly test/dev) markets. ElastiStor’s offering is hard to compare to other ZFS-based storage solutions like Nexenta due to their added secret sauce. And really, the similarities end with the similar capabilities enabled by ZFS. The underlying OS, the storage objects and other constructs are hardly comparable. Although I found a lot of the underlying hardware concepts very similar, the higher I went into their stack, the more different it was.

There are two types of additional types of software that can be added to an ElastiStor OS instance. Add-ons extend the capabilities, while plug-ins interoperate with server- and hypervisor-centric software.

Add-on Utilities

Add-on Utilities provide either optional protocol support (such as Fibre Channel) or interoperability among ElastiStor nodes.

  • ElastiReplicate enables remote asynchronous replication between two ElastiStor nodes. This feature provides disaster recovery capabilities that extend the core capabilities of ElastiStor.
  • ElastiHA provides high availability groups of up to four nodes. While conventional high availability only protects against the failure of one node, ElastiHA protects against the failure of up to three nodes. This capability is a perfect match for ElastiStor RAIDZ-3 data protection capabilities that protect against the failure of up to three physical drives without losing data. ElastiHA can be controlled from both ElastiCenter and our REST API.
  • ElastiFC enables ElastiStor to act as a block-level Fibre Channel target device. With the plug-in, ElastiStor’s storage resources (ZFS volumes, also called ZVOLS) are presented to the server as Fibre Channel disk(s). All functionality is available from both ElastiCenter and our REST API.

Plug-in Utilities

Plug-in Utilities provide integration and interoperability with virtualization environments, cloud stacks, and applications. Plug-ins are available for:

  • VMware vCenter plug-in allows admins to create and manage performance-selectable VMs in real time right from the vCenter console.
  • Citrix CloudStack plug-in allows storage provisioning and management in real time right from CloudStack.
  • OpenStack plug-in allows storage provisioning and management in real time right from OpenStack.
  • Microsoft SQL Server plug-in provides application-consistent snapshots for Microsoft SQL Server databases.

Differentiating Features

Let’s look at the three features that set ElastiStor apart.

Multi-tenancy

CloudByte enables multi-tenancy by first separating the back-end storage infrastructure from the front-end storage protocols and components and then enabling multiple accounts to access the front-end storage system (both via a storage protocol and the management interface). It accomplishes this by addressing several different front-end components via different IP-addresses and VLANs per tenant.

The back-end is related to physical storage and nodes and includes:

  1. Sites
  2. HA Groups
  3. Nodes (servers running ElastiStor with NICs and FC HBAs)
  4. Pools (consisting of shared storage, i.e. JBOD disk arrays configured for a ZFS software RAID level)
  5. Storage administrators (storage team managing the back-end storage infrastructure)

The front-end is related to the ‘users’ of the platform and moving parts to make that happen. It includes:

  1. Accounts (tenant, team, department, client, customer)
  2. Account administrators (storage admins managing the account: TSMs and Volumes)
  3. Tenant Storage Machines (QoS policies for storage performance, throughput and latency)
  4. Storage Volumes

The ZFS architecture in CloudByte’s product apparently isn’t a standard version. They claim to have changed how ZFS works to make it aware of to whom the data belongs. This true for data stored on disk, but also data in flight: the system knows which data packet belongs to which tenant. They call this resource tagging, and they’re able to tag about 40 individual types of resources. This process of tagging allows, for instance, isolation I/O in specific queues and allows the product to control the QoS for about 85% to 90% of all I/O flowing around on the fabric.

Their claim is that on the levels below the TSM, you won’t be able to see who the data actually belongs to. There’s no mention of on-disk or over-the-wire encryption. By carefully designing the physical and virtual (TSM) storage layer a physical separation of tenant data is still possible, though (but that wouldn’t be the most efficient way to go about your design). This is actually quite flexible: you can do total physical separation (including controllers, JBODs, disks and physical network infrastructure), use a share-everything approach and even every design in between. This flexibility should really speak to service providers, where the customer’s requirements are very unpredictable and can change with every new customer.

Guaranteeing Quality of Storage Performance

Given the fact that usually nobody, not even the application owner or the storage admin, really knows how much storage performance a given application needs, it’s hard to guarantee each consumer on a storage platform the resources they need and/or are paying for. Usually, this problem is circumvented by isolating and over-provisioning the underlying hardware in terms of storage capacity and performance. When dealing with large enterprises and their critical applications, this additional cost is usually accepted, since the risk and impact of IO bottlenecks don’t weigh up to the cost. For test/dev applications and for the public Service Provider market, this additional cost is harder to swallow. In these environments, it has become more important for the storage admin to carve up available (shared) resources to each tenant, whether it’s a department or individual customer to serve them with exactly the resources they’re entitled to.

Usually, this entitlement is what they’re paying for and guaranteeing the availability of the resources in these types of resource-restricted infrastructures is key in keeping the tenant satisfied. Of course, this works both ways, since this is how the Service Provider is making money, and he’d like to stay in business by not giving away (too many) free resources.

This QoS thing will only work cost-effectively if there are lower priority workloads to fill up the queue and put the physical resources to good use. This model won’t work if an organization only does Tier-1 applications or has no discernible relative difference between workloads. In such a case, dedicated physical resources are a better way to go, but with a mix between size, required performance and tier (which would be defined in SLA terms), the service provider or enterprise customer can optimize their use of the storage system (i.e. grow in small-ish steps), keep CapEx down and still be able to deliver performance to customers and meet the SLA for each individual tenant.

As far as storage performance goes, CloudByte doesn’t really add any new magic into the mix from a back-end perspective. They’re ZFS based, and they rely on the ZIL, ARC and L2ARC for performance and caching. The usual ZFS design guidelines apply undiminished. From a front-end, storage consumer, perspective, they do add Storage QoS, which is a whole different ballgame.

Tenant Storage Machine

Each Elastistor node runs a multi-tenant layer called ‘Tenant Storage Machines’. You may also call this a storage hypervisor or ‘software defined storage’. These TSMs provide data services that are exposed to each tenant and his resources (hypervisors, VMs, apps, etc.). TSMs are fully fledged storage controllers, but also contain QoS controllers, in which you can define and manage storage performance on a per-tenant basis. Each TSM has their own IP address / port and are multiprotocol (NFS, iSCSI, CIFS).

9 Storage_volume_definition-1

10 TSM-MigrationTSMs are not bound to an underlying physical host. They cannot span multiple pools, however. TSM can move around pools. This means that you can do a Storage vMotion-like migration of a TSM from one pool to another. A nice feature (which is derived from the ZFS-layer underneath) is that the storage migration can have dissimilar source and target hardware and cache and workload history are moved, too. Also, the migration can be scheduled to occur in off-peak moments.

A smaller brother of a TSM, the Disaster Recovery TSM, provides ‘shadow’ services for tenant-level DR-purposes on a different site using ZFS asynchronous replication.

cb-vcVolumes are carved out of the TSMs and are tenant-specific. Volumes can be presented out using a number of protocols like NFS (v3 and v4), CIFS, iSCSI and FC. The file-based protocol NFS allows for VM-based granularity with QoS from within the VMware vCenter Server using the integration plugin.

QoS (specifically limits) is applied on a per-volume (but can be configured at the pool and TSM level, too) basis and comes in two flavors: hard limits and grace limits.

Grace is the provisioning of unused IOPS/throughput in Pools to Storage Volumes based on the performance requirements. You can configure Grace at Pool level and Storage Volume level. It’s lending IOps from other TSMs or pools for volumes as long as these have the resources to spare.
12 Configuring-Grace_1

Using limits in any quality of service scenario is kinda weird, though. QoS should be all about guaranteeing minimum performance, not limiting it. But by limiting other tenant’s performance, a higher priority tenant is given (but not specifically guaranteed) more performance.
If a limit is reached, obviously the application wants (and/or needs) more performance. If the application is important, it should’ve gotten more performance than the limit dictates (or even than the physical box can deliver, since Elastistor has a vMotion-like storage feature).

If the application isn’t all that important, it could’ve gotten all that performance if, and only if, no higher priority applications have a demand that depletes the physical limit, or, in other words, if the physical pool backing that volume has spare performance. If not, it should be throttled.

The problem here is that limits alone are way too static; ideally, we have a more dynamic system and relative shares like with VMware’s DRS and have the system take actual physical limits into account. The relations between the tenants are what matters here, since there’s always going to be physical performance limits. Given these limits, the TSMs try to do QoS by ways of a self-tuning dynamic optimization, assigning node resources like processor cycles, memory, networking and storage bandwidth. Much like DRS, the interface provides an initial placement recommendation for newly created TSMs.

The heuristics in the product start with multiple assumptions on the application’s or tenant’s I/O pattern and every minute these are improved based on real world data from the app as well as the available physical resources. This data is stored inside the ElastiCenter. This means the volumes in a TSM are stateful, and all of this QoS and caching data is preserved. This also helps to give a I/O performance approximation for any volume on the target storage system.

All in all, the QoS feature isn’t a unicorn or other magical creature. You still need to do the math and see if the minimum of performance for all tenants is less than the physical environment can deliver.

Concluding

They’ve taken the easy route by leveraging existing pieces where possible, like the OS (FreeBSD), filesystem (ZFS) and hardware appliances. This allows them to focus on true innovation, such as their Tenant Storage Machine, which enables multi-tenancy and QoS. Relying on existing technologies can hold them back, though. ZFS is well-know for is reliability, but architecturally, might not be ready for distributed storage, stringent QoS or multi-tenancy. There’s always a risk of running into snags when using older, existing technology. I’m very curious if CloudByte will ever hit the disruptive upgrade issues we’ve been talking about recently due to their early design choices. I’m guessing the lack of true tiering in ZFS is already a nuisance for the CloudByte team, and I’m sure there are other limitations where the team is having a hard time overcoming them.

The platform has many of the popular features any startup should have these days, but they still lack a rich set of data services you might find in other storage startups. They’re still ticking a lot of the feature comparison boxes, but some feel outdated or non-complete. For instance: they leverage ZFS’s compression and deduplication, both of which have a fair share or strong opponents and practical limitations, and they’re lacking a true tiering solution.

Both differentiating features, QoS and multi-tenancy, are well-thought out, but they have some way to go in improving the features. Specifically, I think their TSM concept can be developed further into better granularity (VMDK, vVols) and visibility into individual virtual machines. Also, like with DRS, the load balancing of performance and capacity across multiple HA Groups should be automatic.
I do think the QoS concept needs a large overhaul, specifically by adding resources reservations in addition to limits. Reservations are much more relevant to both the customer (it’s a better guarantee, after all) and the service provider (since reservations are easier to sell).

Also, the performance measurement on the back-end of the system is very static at this point. The values are derived from industry standard performance metrics, and aren’t tested against any actual storage pool. Instead of the administrator having to do actual load tests before the array goes into production, I’d suggest CloudByte integrating such testing into the platform. Maybe something similar to Nutanix’s diagnostics.py? It would certainly leave out the guesswork and make this feature align better with reality.

Looking at their current state of development, I think CloudByte has done a good job in the limited time they’re generally available. All in all, I like their QoS and multi-tenancy concepts enough to veer to the positive side and be optimistic about their product and future. Whichever way they go, I’m very interested in seeing their technology develop and mature over time.