Storage calculation for clusters with mixed capacity nodes

This article explains the logic for storage calculations for clusters having node with different storage capacities

What has changed?

Previously, the capacity calculations had been based on aggregate capacity across nodes in the cluster. This total capacity used to be the base for calculating usable and effective capacity in the cluster.

For example: Consider 3 nodes , N1=20TB, N2=20TB and N3=10TB

Based on the above, the total capacity available is 20+20+10 = 50TB and assuming (N+1) , the available nodes are N2+N3 = 30TB. Thus, 15TB can be used for data and 15TB for RF (assuming RF2)

With the new update: Sizer also ensures that the RF copy of the data and the data itself do not share the same node.

In the above example: after N+1, two nodes are available N2 = 20TB ,N3 = 10TB.

If we allow writing 15TB of data (and 15 for RF), part of the data and RF has to be on same node as N3 is only 10TB. So, to ensure the RF copy and data is on separate nodes, the usable storage in this case would be 20TB ( 10TB of data on N2 and its RF on N3 or vice versa).

Note: Although the same logic is used for both homogeneous and mixed capacity clusters, the difference is seen primarily for the mixed capacity clusters.

Here is a detailed write up on how the usable storage is calculated for clusters with mixed capacity nodes for different scenarios across RF2 and RF3

Algorithm for RF2

If we have only one node with non-zero capacity Cx, then in RF2 the replication is done between the different disks of the same node and hence the extent store in this case will be Cx / 2 (RF), else one of the below cases applies. Let us say, we have nodes with capacities C1, C2, C3, …., C[n] which are in sorted order according to their capacities. There are 2 cases to consider for RF2, to compute the effective raw storage capacity:

Case-1: C1 + C2 + C3 + …. + C[n-1] <= C[n] 
If this is the case, then the total amount of storage that can be replicated with a factor of 2 is  ∑(C1, C2, C3, …., C[n-1])

Case-2: C1 + C2 + C3 + …. + C[n-1] > C[n]
If this is the case, then the (total storage capacity) / 2 (RF) can be replicated among the available nodes. In other words, half the total capacity can be replicated.

Algorithm for RF3

Let us say, we have nodes with capacities C1, C2, C3, …., C[n] which are in sorted order according to their capacities. Algorithm for RF3 is slightly different from that of RF2 because we need to accommodate the replica of data on 2 nodes, as opposed to a single node on RF2.

  1. Since there are 3 replicas to place, we calculate the capacity difference between the 2nd largest (C[n-1]) and the 3rd largest (C[n-2]) entities as ‘diff’. This information is necessary so that given an optimal placement scenario where the first replica is placed on the entity with the smallest capacity, the second replica is placed on the entity with the largest capacity (C[n]) and the third replica is placed on the entity with the 2nd largest capacity (C[n-1]); the difference between the 2nd and the 3rd largest capacities ((C[n-1]) – (C[n-2])) will help us quickly deduce when the 2nd largest entity will become equal to the 3rd largest entity by virtue of space consumed on the former via replica placement.
  2. By deducting either the ‘diff’ calculated above (or) the capacity of the smallest entity and simulating RF3 placement such that C[n-2] and C[n-1] have now become equal (note that the difference between C[n] and C[n-1] will remain constant during this since the same capacity is deducted from both of them), in O(N) we arrive at the possibility of:
    • Case-1:  Only 3 entities remain with non-zero capacities, in which case the amount of data that can be accommodated among these 3 nodes with RF of 3 (one actual node and 2 replicas) is the smallest remaining capacity, which is C[n-2].
    • Case-2:There is capacity left in C[n-3] (i.e. the 4th largest entity) and any number of nodes before it (i.e., C[n-4], C[n-5], … etc) and C[n-2] == C[n-1] (i.e. the capacities remaining on the third and the second largest entities have become equal). This is because at this point, the capacity on the smallest entity remaining (the smallest non-zero entity before C[n-2] i.e) is greater than C[n-1] – C[n-2], indicating that after placing the first replica on C[n] and second replica on C[n-1], the time has come where the capacity on C[n-1] == C[n-2]. At this point, for the next bytes of data, the second replica will go to C[n] while the third replica will be round robin-ed between at least 2 (or more) entities. Now in this scenario as well, 2 cases can arise:
      • Case-2(a): (C1 + C2 + … + C[n-1]) / 2 <= C[n]
        Now, if C[n]’s capacity is so high that it means that for every 1st and 3rd replicas placed on the lowest capacities nodes upto C[n-1], the second replica always finds space on C[n], then it implies that, if (C1 + C2 + … + C[n-1]) / 2 <= C[n], then the amount of storage that can be accommodated on available nodes with RF of 3 is the lowest among the two sides of the above equation i.e., (C1 + C2 + … + C[n-1]) / 2, as we cannot consume the full space on C[n].
      • Case-2(b): (C1 + C2 + … + C[n-1]) / 2 > C[n]
        But if C[n]’s capacity is not so high as in case (a), i.e., (C1 + C2 + … + C[n-1]) / 2 > C[n], then replica placements for one of the replicas will be on the largest entity C[n], while the other two replicas will round-robin amongst the other largest capacity entities (since the capacities remaining on at least 2 entities C[n-2], C[n-1] are already equal). This will continue until C[n] becomes equal to C[n-1], which is guaranteed to happen eventually because the replicas consume space on C[n] at least at a rate double than C[n-1], C[n-2], … From that point, both the second and the third replicas will continue being round robin-ed across all the remaining entities, and thus all the capacities remaining at that point can be fully consumed. Hence, in this case, the amount of storage that can be accommodated is the sum of all remaining (non-zero) entities divided by 3 (RF).

Terminologies

We will explain the terminologies with an example. To keep it simple, let us take a homogenous cluster with 6 identical nodes.

Effective Raw Capacity is the Total Raw Storage available across clusters within the scenario. In this case, it is 6 nodes with 6 x 1.92TB drives, amounting to 69.12 TB or 62.86 TiB, as shown in the first row in Figure 2 below.

Failover Capacity Overhead is the storage that is discarded according to the failover plan selected. In the above case, the failover plan is set to “Standard (N+1),” and hence, we arrive at this value by discarding all the storage available within the node. The solution in the above example is an all-flash node with 6 x 1.92TB drives per node. The Failover Capacity Overhead amounts to 11.52TB per node or 10.48 TiB per node. The same can be seen in the second row in Figure 2 above.

Effective Raw Capacity After Failover = Effective Raw Capacity – Failover Capacity Overhead

In the above example, that is 62.86 TiB – 10.48 TiB = ~52.39 TiB

Effective Usable Capacity After Failover = ~95% of Effective Raw Capacity After Failover as AOS stops writing to the disk when the cluster utilization reaches 95%.

In the above example, that would be 52.39 * ~0.95 = 50.22 TiB

Extent Store = (Effective Usable Capacity After Failover – CVM) / Replication Factor

In the above example, with RF set to 2, it would be (50.22 TiB – 9.14 TiB)/2 = 20.54 TiB

Effective Capacity = Extent Store + Savings (Storage Efficiency & Erasure Coding)

In the above example, without saving, the Effective Capacity is 20.54 TiB.

If we set the Storage Efficiency to 18%, the Savings amount to 4.52 TiB, and hence the Effective Capacity would be 20.54 TiB + 4.52 TiB = 25.06 TiB. The same can be seen in the bottom part of Figure 2.

Please note that the numbers will vary for heterogeneous nodes when different capacity nodes exist within a cluster. The larger the variance of capacity across nodes, the Extent Store, and hence, the effective capacity will be reduced to ensure there is enough storage present across nodes to maintain copies of data as per the desired Replication Factor.