So today, Nutanix has released version 4 of their ‘Nutanix OS’. NOS is the heart of the Nutanix solution, and is the software that enables all the goodies. It’s what makes Nutanix, well, Nutanix, and does all the distributin’ and all that.

So, what’s new?

Let’s do a play-by-play of new and improved features.

Better Performance, Availability and Scale

Better Performance, Availability and Scale

First up, the data fabric (or data services).

Integrated Data Protection

This feature is all about snapshots and replication. Nutanix has done both for a while now, but has brought these features out into PRISM and allows the admin to schedule, manage and use local and remote snapshots and replication for backup and disaster recovery. Nice little additions are the GUI-based snapshot schedule and snapshot retention policy.

Screen Shot 2014-04-15 at 16.35.45

MapReduce Data Deduplication

In addition to real-time (and performance-tier focussed) data deduplication, NOS v4 introduces post-process deduplication for the capacity tier. Just like most, if not all Nutanix data services, deduplication is distributed across all nodes. This feature is one of the best from a buying position: it makes your Nutanix investment more worthwhile with a better effective space utilization, better VM density and better ROI.

This nicely adds to the support for VDI-environments, where you can now take two separate routes for managing storage:

  1. VAAI / VCAI Clones + Nutanix Shadow Clones
  2. Full Clones + Deduplication

48566806

Tunable Redundancy

This marks the release of a user-configurable level of fault tolerance on a per-VM base within a single cluster. Since it’s not tied to any physical layer, it’s easy (and supported) to migrate between different Replication Factors on the fly. I’m not sure if the redundancy can be integrated with fault domains for block, rack or site awareness.

Screen Shot 2014-04-15 at 17.16.45

Availability Domains

Availability Domains (Failure Domain Awareness)
Also known as ‘Block Fault Tolerance’ or ‘Rack-able Unit Fault Tolerance’ the availability domain feature adds the concept of block awareness to Nutanix cluster deployments. It works managing the placement of data and metadata in the cluster, ensuring that no singular replicated data is stored in the same Nutanix block for high availability purposes.

Availability Domains add the notion of a fault tolerance domain to various types of objects in the Nutanix inventory, specifically nodes and blocks. Dwayne Lessner explains it here: NOS 4.0 – When is it safe to upgrade the hypevisor?. Now I start to wonder if these Availability Domains can be used to extend this awareness to entire sites, and thus create an (unsupported) stretched cluster… It would be good to see the new tunable Replication Factor integrated with the Availability Domain feature.
For me, availability Domains is the single biggest feature in NOS v4! I happen to have had a block or two delivered for a project today, so I will be diving into this feature (in combination with the Tunable Redundancy) in the next couple of weeks.

Powerful Management, Analytics and Automation

Powerful Management, Analytics and Automation

Next up, the management services:

PRISM Central

Not unlike Nimble Storage InfoSight, which I wrote about: Why Nimble should open up InfoSight to the community (like vOpenData), Nutanix is starting to open up their phone-home system to their customers. For now, PRISM Central is a way to centrally manage clusters across locations from a single interface, aggregating health and usage data , providing single sign-on and simplifying workflows.

Screen Shot 2014-04-15 at 16.35.11

Smart Support
When enabled by the administrator the smart support feature collect statistics from all the nodes in the cluster and send a summary to Nutanix via email. This information is used for debugging and troubleshooting. In the future this data may also be used to auto-diagnose and alert administrators of possible miss-configurations or problems.

I’m guessing PRISM Central will include more and more monitoring, alerting and support integration in the future, transforming customer support into a unique selling point, just like Nimble did with Proactive Wellness. Nutanix calls this storing historical data for deeper analysis and auto-diagnose and alert administrators of possible miss-configuration or problems for now. Yup, sounds just like InfoSight and Proactive Wellness…

Cluster Health

One of the things that could move up the stack into PRISM Central in future releases is Cluster Health. This tool identifies, troubleshoots and resolves various issues automatically. In a nutshell, it monitors VM, node and disk health.

Screen Shot 2014-04-10 at 16.26.40

In previous releases, this functionality was only available in the CLI and only ran on-demand. With NOSv4, this features has been brought up into PRISM and runs in the background continually.

One-click NOS Upgrades

One other feature that made it into the PRISM GUI is the rolling cluster upgrade feature. The tool has been hardened and more intelligent, more workflow-based and now supports parallel upgrades (and serial reboots, for that matter). Of course, this is all non-disruptively and requires no manual intervention.

Screen Shot 2014-04-15 at 16.35.32

PowerShell cmdlets

Probably not the most sexy feature, but essential to have these days.

Other New Stuff

Finally, please check out the official launch page and official blog post. Andre Leibovici has a lot more new stuff to share:

Shadow Clones (Official Support)
Shadow Clones is finally out of tech-preview. Shadow Clones intelligently analyze the I/O access pattern at the storage layer to identify files shared in read only mode (ie: Linked Clone Replica). When a 100% read only disk is discovered, Nutanix will automatically create a snapshot at the storage layer on each Controller VM (CVM) and redirect all read I/O to the local copy, drastically improving end-user experience. Read more at Nutanix Shadow Clones Explained and Benchmarked.

Smart Pathing (CVM/AutoPathing 2.0)
The new and improved CVM AutoPathing 2.0 prevents performance loss during rolling upgrades minimizing I/O timeout by pre-emptively redirecting NFS traffic to other CVMs. Failover traffic is automatically load-balanced with the rest of the cluster based on node load.

My wish list for NOS.next

One of the biggest features absent on the list is stretched clustering support and vMSC certification. Now I know this feature is less attractive for U.S.-customers, but for us Europeans it still is a killer feature! With Availability Domains and Tunable Redundancy, a (decent) first step has been taken, but I expect and hope this feature to be built out in the near future. I certainly will be diving into these two features!

A good second would be Veeam and Nutanix making a baby. Now, there are some public sightings of some pre-marital courting here and here, so here’s to hoping that we can make backup from Nutanix snapshots (and restore from them) in the near future.

As far as integration with other vendors go, I’d like to see Nutanix jump on the hardware management bandwagon. It’s downright painful to manage all that physical hardware right now, and Nutanix could fix that by leveraging a 3rd party tooling for firmware/bios management and configuration and preferably integrating it into the existing rolling upgrade workflow.