August 2021 sprints

August 2nd sprint:
Following the announcement around NX-G8 , Sizer also went live with the NX-G8 (3rd Gen Intel Xeon Processors/Ice Lake) with this sprint.. below are the highlights:NX-G8
  • Available for end to end sizing and FL quoting
  • NX-1065-G8 on the lower end with more core/cpu options
  • NX-3060-G8 and NX-3070-G8 with All flash/NVMe builds
  • NX-8055-G8 on the higher end with both Hybrid and All Flash(including NVMe)
  • NX-3055G-G8 with the GPUs
  • The larger All Flash systems NX-8150-G8 and NX-8170-G8 to follow in few weeks
  • Both G7 and G8 would be available for now (and either would come up in recommendations based on workload)
Others
  • CVM cores – accounting for physical cores going forward and translating into specInt adjusted cores for the sizing stats
  • Solution option – added Prism Ultimate option (both core and node based licensing)
More details on launch announcements for NX-G8 and spec sheet links here : https://www.nutanix.com/blog/nutanix-launches-support-for-next-gen-of-platforms-for-hybrid-cloud-deployments
August 1st sprint:
We went live with the current sprint with a major change around thresholds…, you would have already noticed the banner displaying in the UI today.. more details below..Thresholds in Sizer :
  • The default threshold in Sizer used to be 95% which is also the same as maximum allowed
  • With this release, the defaults are being adjusted and is set at 85%, while the max continues to be 95%
  • While the default is changed to 85%, users can still go to settings and change it back 95%
  • We feel its more prudent to go with 85% as it leaves some room for estimation/spikes and for upgrades (node is taken down-(N+0) while still allowing to go with 95% if needed or not critical or off peak upgrades..
  • This is also consistent with our observation on utilization % for manual sidings (which had more buffer than the corresponding auto sizing – possibly for the same reasons)
  • This applies for new sizing done starting today. Existing sizing’s continue their previous thresholds ( clones treated as new so new thresholds would apply)
Other enhancements:
Usability
  • Core based licensing for Prism / Flow
  • AOS rule on minimum flash at 4% of total node capacity (also applicable for Files/Objects)
Platforms
  • NX memory replacement from 2933MHz to 3200MHz
  • NEC: New Models CL 4LFF and 12LFF Platforms
  • New Platform: Lenovo HX7820 2 socket variant
More explanation behind the threshold changes and rationale in Sizer wiki : https://sizer.nutanix.com/#/help/articles/1035

August 2021 release (CVM explanation)

CVM cores – What has changed and Why?

How does Sizer assign CVM cores up until today(Aug 2021)?

Sizer allocates resources to CVM as part of the sizing exercise.  Here, we will be looking at CVM cores specifically.

Sizer allocates CVM cores based on a combination of factors – like workload type ( 4 for Server Virt, 12 for Databases etc) or based on node type (higher CVM cores for NVMe nodes for ex ) or as per guidance on certain features (rules around Async/NearSync etc).

However,  while attributing cores to CVM, Sizer used ‘Effective Cores’ , which means, these were specInt adjusted cores and not actual physical cores available to CVMs.

For ex:

Lete say Sizer allocated 4 cores to CVM.  These are ‘effective cores’ which are specInt adjusted.

As seen in the table below, these many “effective cores” are attributed to CVMs in the sizing stats table.

7 nodes and 4 cores for CVM per node : 7 x 4= 28 cores (effective cores)

Lets say that the node recommended had the CPU – Gold 5220.

So, translaitng the effective cores to physical cores –

28 ‘effective cores’ – is approximately 22.4 physical cores (adjusting for the specInt for Gold 5220)

=22.4/7 = 3.2 physical cores/CVM

So roughly 3 physical cores for CVM – which is on the lower side and can cause performance issues.

As CPUs get better in performance (Ice Lake > CaseCade Lake > Sky Lake), their specInts are higher and thus, it translates to even fewer physical cores.

What has changed?

The CVM cores allocation is now based on the physical cores (and not effective cores). 

So, when Sizer assigns 4 cores, its 4 physical cores (of Gold 5220 in above example) and not effective cores.

Folloiwing up from the previous example: (Refer to image below)

The tooltip shows the total physical cores assigned to CVM : 7 x 4= 28 cores

For the rest of the calculation on the sizing stats, these are convered to effective cores , so , 28 physial cores = 33.5 ‘effective cores’

Note:  Depending on the processor, the effective cores value ( in red here as -33.15) can go as high as 50-60% of the physical cores (for example for high end CasCade Lake refresh or high end Ice Lake processors), further validating the point that CVM would be getting fewer underlying physical cores (and hence the change).

What is the impact to sizing as a result of this change:

The sizings are more robust and aligns the CVM allocation ensuring to remove any CVM performance bottlenecks, while aliging it with foundation recommendations.

With more high end processors having significantly higher specInts ( Cascade Lake refresh and now Ice Lake), the gap between effective cores and physical cores is getting wider. This change will ensure that while UVMs do take advantage of the better processor capabilities, CVM gets the cores it requires for optimal performance and doesn;t lead to latency issues.

Understandably and expectedly, this change,however, would increase the core requirement for the cluster (as against previous sizings). For an existing sizing, this can be observed, more predominantly while cloning, which would apply the new CVM minimums to the cloned scenario leading to increased utilization of the CPU dials.  This , as a tradeoff, for higher CVM allocation for a better, more robust, performance optimal sizing scenario.

July 2021 sprints(Thresholds )

Sizer Thresholds – What has changed and Why?

 

What are thresholds in Sizer? 

Sizer has a feature called thresholds. These are defined individually for each of the sizing resources – cores, memory, SSDs, HDDs & GPUs (wherever applicable). These thresholds ensure that the total available resources in the nodes(cluster) are sufficient to meet the workload requirements and also account for some buffers for the unforeseen surges in workload applications

What has changed in thresholds?

Up until July 2021, the threshold defaults across cores/memory/SSD/HDD used to be 95% as can be seen(and modified) under the policy screen as shown below.

Note that the default was set to 95% which is also the maximum allowed. Users can go for a lower threshold (more conservative sizing with more buffer for future spikes). However, under no circumstances, sizer allowed to go higher than the default – greater than 95% –  to provide for a 5% margin for accomodating sizing errors/estimates and workload usage uncertainties.

Starting August 2021, Sizer would be changing the defaults for these thresholds to 85% across ALL resources[cores/memory/ssds/hdds) as shown below.

Note that the defaults have moved left to 85% , however, the maximum allowable utilization of the cluster resources still remains at 95%.

Why?

Why the change?

Having the maximum allowable and default both at 95%  at times did not provide enough margin for sizing estimate errors or unforeseen workload usage or spikes as only 5% left.  Given making accurate estimates is hard, we felt it was prudent to provide more slack with a 85% threshold.

To be clear though, many sizings have been done successfully at the old 95% level.  This move was also supported by Sizer users doing manual sizings who often opted for more slack.  This change was done to be more prudent versus any sizing issue.

When is it best to leave it at 85% Threshold

We feel for most sizings this is the more prudent level.  Allows for more room for estimate errors and for that matter customer growth

When might it be fine to go to 95% Threshold

Certainly numerous sizings have been done with 95% threshold and customers were happy.   We still do allow 95% to be the threshold.  These are the N+0 thresholds and so at N+1 there is a lot more slack.  The 95% level hits when one node is taken offline like for upgrades.  If the customer does upgrades during off-hours, their Core and RAM requirements are a lot less than normal and do not hit the higher threshold anyway.    Again we feel it is more prudent to leave it at 85%, and going higher just means you need to be comfortable with your sizing estimates and especially when the cluster is at N+0 (during an upgrade) 

What are the implications to existing sizings? 

First-the new sizings :

All new sizings (effective 9th August 2021) will have default thresholds at 85%. And since it is a significant change which impacts ALL new sizings and ALL users(internal/partners/customers), there will be a BANNER displayed prominently for two weeks for general awareness.

 

Implications to existing sizings : 

There will be NO impact or implication to the sizings created before 9th August 2021. The existing sizings would continue with the default threshold of 95% and would calculate the Utilisation %ages,  N+0,1 etc based on the previous default threshold of 95%. Thus, there won’t be any resizing or a new recommendation for existing sizings and those sizings and their recommendation holds good for that scenario.

Cloning an existing scenario: 

Cloning an existing sizing will be treated as a new sizing created after 9th August,2021 and thus, new sizing rules and default thresholds will apply.

One implication of this can be that there will be an increase in utilisation %ages across the cluster resources. This is because now, only 85% of the resources would be considered available for running the workload as against 95% earlier. This unavailability or in other words – reservation – of additional 10% of resources may drive to a higher node count (or make an existing N+1 solution as N+0) in some edge circumstances.

User can choose to resize for the new defaults , which may lead to higher node or core count – but that is for the better-as explained above-providing for margings and spikes – or- since it is a clone for an exsiting sizing which may have been sold to the customer – user can , alternatively, go to the threshold setting and move it to the right- back at 95%- which would then give back the same recommendation as the original sizing.

Sizer 5.0

Sizer 5.0 is the latest version of Sizer going live on 24 Feb 2021

Whats New?

Three major features in Sizer 5.0 :

1. Multi recommendation

Sizer to now have an option to recommend more than one solution for a given workload, depending on the price range.

2. Sizer Policy

These are the recommended cluster settings based on the deployment environment for the cluster being sized. Sizer strongly recommends to go with the default settings for the chosen environments , however, it allows you to make modifications to adjust to a given requirement.

3. Advanced Cluster settings

These are advanced filters to narrow down the sizings to a more specific solution, providing for greater flexibility and ability o accomodate specific customer requests.

 

Sizer journey to 5.0:

From single workload to multi workload to multi cluster to finally multi recommendation with Sizer 5.0

 

Multi-era  for Sizer: 

 

  • Multiple Workloads
      • Bulk Edits – Ability to update, delete, disable, enable many workloads at once
      • Enable our next move towards Collector-driven sizing where Collector feeds Sizer with 100s of workloads to create the  most precise sizing

 

  • Multiple Clusters
      • Cluster Settings – Ability for each cluster to have its own settings for common characteristics
        • CPU speed, NIC, Max nodes, thresholds, etc
        • This allows each cluster to be optimized for specific workloads
      • Sizer Policy – Apply best practices defined by experts for different environments
        • Settings for Test/Dev, Production or Mission Critical Environments
  • Multiple Recommendations

  • Cluster Settings gives the user control so Automatic Sizing gives desired recommendation
  • Multiple recommendations then allow user to play with the results
    • Cost Optimized, Compute Optimized, Storage Optimized solutions are provided
    • Each can be further tweaked by the user

 

Sizer 5.0 – Multiple Recommendations

 

Toggle between multiple recommendations that fit your cost tolerance

  • Cost Optimized – lowest cost-default option
  • Compute Optimized- most cores within cost tolerance
  • Storage Optimized – most HDD/SSD within cost tolerance

  • Cost Tolerance
  • This is advanced settings in cluster
  • Allows to select a price delta(from the cheapest)
  • Triggers multi recommendation within the price range

 

Sizer 5.0 – Policy

Why..

Different deployment environments might have different needs in terms of availability/resiliency and performance

What..

These are the recommended cluster settings based on the deployment environment. Brings consistency of sizings for a given environment.  

How.. 

Each cluster to have one of below policy

  • Test/Dev, Production, Mission Critical
  • Allowed to edit the policy settings 

Apply your own Sizer Policy for cluster characteristics

  • Maintenance Window requirements
  • Network speed
  • Minimum Compression
  • Other settings

Each cluster can then follow operational policy

  • Test/Dev, Production, Mission Critical
  • You can edit the policy to better meet customer needs

Sizer 5.0 – “Customize” the auto 

  • Tweak the auto recommendation through “Customize” option 
  • Allows incremental increase or decrease of the selected resource
  • Checks for valid /qualified combinations when tweaking( for ex: if tweak violates SSD/HDD rule or balanced memory config, won’t allow 
  • Shows the cost delta for the customized solution
  • Stay on the Solutions page while playing with options.

Sizer 5.0 – Advanced Cluster settings

Minimum CPU frequency:

  • This will ensure the sizing recommends only the processors above the quoted frequency
  • Helpful if customer is keen on certain range of processors for performance reasons 

CVM/node

  • The values here will override the default CVM overhead applied by Sizer -1
  • Allows customer to provision more cores/ram to CVM in case of performance sensitive workloads

Cost Tolerance

  • Allows for a price delta (from the lowest cost)
  • Recommends more than one(default) solution-2
  • Cost optimized :default-lowest cost solution
  • Compute optimized : most core heavy solution in the price range
  • Storage optimized: most capacity heavy solution in the  price range

 

Short demos on Sizer 5.0 features:

Sizer 5.0/5.1 overview:

Multi Recommendation:

Maintainance Window :

Sizer Policy : 

SAP HANA sizing guidelines

Sizing for SAP HANA  

Introductory video of SAP HANA in the sizer (non-ST VPN connection required).

https://nutanixinc.sharepoint.com/:v:/t/solperf/solperf_library/ETb67HebgR9Kg32r7jxTO3wBdXoHuOCCjvK2oTasctOgIQ?e=z4Epr2

Note:

  • Only use for SAP HANA based applications, not legacy SAP applications which make use of MSSQL, Oracle etc.
  • Multi-node SAP scale-out is not supported (used for larger SAP BW instances)
  • This sizing process does not vary for scale-up between AHV & VMware implementations
  • No spinning disks are used within a Nutanix cluster being used for SAP HANA
  • Any questions, support, or areas not covered – please use the SAP Slack channel

Supported Platforms:

  • Only Dell, HPE, Fujitsu and Lenovo are supported for SAP HANA, not NX.
  • If another OEM is selected, SAP HANA will not be shown as an available workload

 

Defaults

  • RF2 is used (RF3 is under testing, so not selectable in Sizer)
  • Compression is disabled, and not typically of value for SAP HANA
  • Higher default CVM resource is reserved

HANA Inputs

  • NVMe can be added for higher IO loads, such as a high usage SAP S4/HANA
  • Cost/Performance largely drives cpu choice. Ideally an implementation’s potential compute load in SAPS would be known. Please reach out for support in estimating and reviewing such information.

Environments

There would typically be two environments within a Nutanix cluster where production and non-production are mixed. Production rules should be applied both to all production instances, and any other instances that should be treated as production. This might apply to a QAS/Test environment and will typically apply to any DR cluster.

Production:

  • For most SAP applications (e.g., production S4) there is an SAP HANA database, and one or more application server instances. Some uses of SAP HANA do not use an application server, in which case just use a small one in the sizing exercise.
  • In addition to the Application Server instances, and the SAP HANA database, a small VM called the ASCS is often called for. This ASCS would be around 2c/24GB RAM/100GB disk.
  • Generally, production has two or more application server instances. Typically, 2 – 6 cores, with around 24GB/core. Multiple instances for larger loads. Small storage space requirement for os & application image.
  • For a downtime requirement of less than 20 minutes, a pair of SAP HANA instances should be sized.
  • There is no over commit of cpu or memory
  • Servers must have all memory channels filled and balanced, so 6 or 12 DIMMs per cpu. – Sizer auto recommendation enforces this consideration
  • L suffix cpus are required for largest memory instances
  • Available storage for SAP HANA should be around 2.5x to 3x memory (3x is used in Sizer)
  • Production rules – SAP HANA instances are on whole dedicated cpus and so cannot be allocated to the CVM cpu
  • HANA System Relication(HSR) – is exactly a copy of the HANA VM. In Sizer, add another HANA VM if implementing the HSR.

 

Non Production:

 

 

  • QAS/Test landscape tends to match nonPRD for size of instance
  • If an operating system HA cluster is used in production, there is typically at least one such cluster outside of production also – used as a testbed.
  • Each SAP solution would normally have two or three non-production landscapes
  • Solution Manager (SolMan) is often overlooked, and not asked for. It is a required instance in the overall deployment and would be sized in PRD with one SAP HANA instance and an application server instance. Another such pair for QAS/test. No HA clustering would be required.
  • DEV, SBX etc. are usually subsets in memory size.

 

Feb 8, 2021

First release of SAP HANA in the sizer:

 

Server Virtualization

What is a Server Virtualization sizing? 

This is the most common workload along with VDI. This can be used for any web app which needs to be sized. Each workload or the application which is to be migrated to the Nutanix software stack is a VM with its own CPU/RAM/Capacity requirements. To simplify for the users, Sizer has set profiles (small,medium,large ) for the VMs but customizable as per the actual application needs.

 

What are profiles in Server Virtualization in Sizer?

Profiles are fixed templates with pre assigned resources in terms of vCPUs, RAM, SSD, HDD to each profile. Broadly, small, medium,large profiles will have different allocation of these resources.

The idea is to facilitate users with the details of a workload (that is a VM)  so they cna quickly fill in number of VMs and Sizer will do the necessary sizing.

Small VM profile template:

Medium VM profile template:

Large VM profile template:

 

What if my VMs are different? Have differen values? 

While these templates and their values are general guidelines, these are customisable.

Clicking on the Customize, opens a  pop-up for user entered values:

Upgrade scenario

What is Upgrade sales option in Sizer?

So far Sizer has been focussing on getting a new cluster sized for deploying a new workload or migrating an existing workload from a 3 tier architecture to Nutanix private cloud.  However, it did not address the upgrade or expansion use case scenario for an existing Nutanix cluster.

With this option available in Sizer, SEs can recreate an existing Nutanix cluster.

 

How does Upgrade sales option work in Sizer?

Pretty much similar to how it works for standard scenario/use case. The SE selects the scenario as Upgrade sale by clicking the checkbox below in the scenario page:

 Sizer acknowledges that it is a Upgrade use case and will show options accordingly. For workloads, it is not different – meaning, SEs can still go ahead and add details about the workload running on the existing Nutanix cluster, however, they need to know the details of the workload in order to recreate the cluster.

Additionally, they can run Collector to kow all about the workload and import Collector output to recreate the existing cluster.

 

So how is it different from standard scenario in Sizer? 

The key difference in Upgrade vs standard scenario is the availability of the HW.  The upgrade scenario will list all older models as well, for ex: for NX, will list G5 & G4 platforms.

While the auto sizing recommends the current HW platforms [like G7 for NX], in both cases, the manual option will have older G4/G5 NX platforms for SEs to recreate the existing cluster with the original NX model which can be (and most likely) be the older platforms which is currently end of sale.

Please see the image below for reference:

What other ways is this feature helpful apart from upgrade ?

This feature is also helpful for expansion purpose.  Users can recreate the existing cluster with G4/G5 models and add another node for expansion, that node could be G7. And this cluster with mixed nodes of existing G4s and new G7s can be created in Sizer with this option.

 

Creating a Frontline Quote (including budgetary quote) for Upgrade scenarios in Sizer?

Sizer also allows you to generate a Frontline quote or Budgetary quote for an upgrade scenario. Since Upgrade Scenario deals with an existing Nutanix cluster deployed at the customer site, users have an option to mark the already existing nodes that are already part of the solution as “Existing”.  Sizer won’t be generating the hardware quote and the licenses for those existing nodes as they are already purchased by the customer. If the existing license is a Life of Device license and if the customer plans to move to a CBL license model, an additional software license needs to be added manually.

The BOM generated by Sizer would contain the overall solution with both the existing and new nodes and the license quantities required for the overall solution.

ECX savings

What is ECX [Erasure Coding] in Nutanix ?

The Nutanix platform leverages a replication factor (RF) for data protection  and availability.  This method provides the highest degree of availability because it does not require reading from more than one storage location or data re-computation on failure.  However, this does come at the cost of storage resources as full copies are required.

To provide a balance between availability while reducing the amount of storage required, DSF provides the ability to encode data using erasure codes (EC).  Similar to the concept of RAID (levels 4, 5, 6, etc.) where parity is calculated, EC encodes a strip of data blocks on different nodes and calculates parity.  In the event of a host and/or disk failure, the parity can be leveraged to calculate any missing data blocks (decoding).

The number of data and parity blocks in a strip is configurable based upon the desired failures to tolerate.  The configuration is commonly referred to as the number of <data blocks>/<number of parity blocks>.

How is ECX savings calculated in Sizer ?

Sizer follows the Nutanix Bible and its guidelines for ECX savings.

Below table shows the ECX overhead vs RF2/RF3 for different nodes:

The expected overhead can be calculated as <# parity blocks> / <# data blocks>.  For example, a 4/1 strip has a 25% overhead or 1.25X compared to the 2X of RF2.  A 4/2 strip has a 50% overhead or 1.5X compared to the 3X of RF3.

 

How does Sizer calculate ECX savings from the above: 

Lets take an example where the cold data for workload is 100TiB.

Also, we will use RF2 as the settings chosen for workload.

So depending on the size of the workload, if the total node recommended came to (lets say 4 nodes), as per the above table: data/parity is 2/1.  So 1.5x overhead for ECX as against 2 for RF2 , thus 50% savings.

For conservative approach and to be on safe side, we only consider ECX for 90 % of the cold data.

ECX applied on 90% of 100TiB = 90TiB

How much ECX savings: 50% = 50% of 90TiB = 45TiB