This article explains the logic for storage calculations for clusters having node with different storage capacities
What has changed?
Previously, the capacity calculations had been based on aggregate capacity across nodes in the cluster. This total capacity used to be the base for calculating usable and effective capacity in the cluster.
For example: Consider 3 nodes , N1=20TB, N2=20TB and N3=10TB
Based on the above, the total capacity available is 20+20+10 = 50TB and assuming (N+1) , the available nodes are N2+N3 = 30TB. Thus, 15TB can be used for data and 15TB for RF (assuming RF2)
With the new update: Sizer also ensures that the RF copy of the data and the data itself do not share the same node.
In the above example: after N+1, two nodes are available N2 = 20TB ,N3 = 10TB.
If we allow writing 15TB of data (and 15 for RF), part of the data and RF has to be on same node as N3 is only 10TB. So, to ensure the RF copy and data is on separate nodes, the usable storage in this case would be 20TB ( 10TB of data on N2 and its RF on N3 or vice versa).
Note: Although the same logic is used for both homogeneous and mixed capacity clusters, the difference is seen primarily for the mixed capacity clusters.
Here is a detailed write up on how the usable storage is calculated for clusters with mixed capacity nodes for different scenarios across RF2 and RF3
Algorithm for RF2
If we have only one node with non-zero capacity Cx, then in RF2 the replication is done between the different disks of the same node and hence the extent store in this case will be Cx / 2 (RF), else one of the below cases applies. Let us say, we have nodes with capacities C1, C2, C3, …., C[n] which are in sorted order according to their capacities. There are 2 cases to consider for RF2, to compute the effective raw storage capacity:
Case-1: C1 + C2 + C3 + …. + C[n-1] <= C[n]
If this is the case, then the total amount of storage that can be replicated with a factor of 2 is ∑(C1, C2, C3, …., C[n-1])
Case-2: C1 + C2 + C3 + …. + C[n-1] > C[n]
If this is the case, then the (total storage capacity) / 2 (RF) can be replicated among the available nodes. In other words, half the total capacity can be replicated.
Algorithm for RF3
Let us say, we have nodes with capacities C1, C2, C3, …., C[n] which are in sorted order according to their capacities. Algorithm for RF3 is slightly different from that of RF2 because we need to accommodate the replica of data on 2 nodes, as opposed to a single node on RF2.
- Since there are 3 replicas to place, we calculate the capacity difference between the 2nd largest (C[n-1]) and the 3rd largest (C[n-2]) entities as ‘diff’. This information is necessary so that given an optimal placement scenario where the first replica is placed on the entity with the smallest capacity, the second replica is placed on the entity with the largest capacity (C[n]) and the third replica is placed on the entity with the 2nd largest capacity (C[n-1]); the difference between the 2nd and the 3rd largest capacities ((C[n-1]) – (C[n-2])) will help us quickly deduce when the 2nd largest entity will become equal to the 3rd largest entity by virtue of space consumed on the former via replica placement.
- By deducting either the ‘diff’ calculated above (or) the capacity of the smallest entity and simulating RF3 placement such that C[n-2] and C[n-1] have now become equal (note that the difference between C[n] and C[n-1] will remain constant during this since the same capacity is deducted from both of them), in O(N) we arrive at the possibility of:
- Case-1: Only 3 entities remain with non-zero capacities, in which case the amount of data that can be accommodated among these 3 nodes with RF of 3 (one actual node and 2 replicas) is the smallest remaining capacity, which is C[n-2].
- Case-2:There is capacity left in C[n-3] (i.e. the 4th largest entity) and any number of nodes before it (i.e., C[n-4], C[n-5], … etc) and C[n-2] == C[n-1] (i.e. the capacities remaining on the third and the second largest entities have become equal). This is because at this point, the capacity on the smallest entity remaining (the smallest non-zero entity before C[n-2] i.e) is greater than C[n-1] – C[n-2], indicating that after placing the first replica on C[n] and second replica on C[n-1], the time has come where the capacity on C[n-1] == C[n-2]. At this point, for the next bytes of data, the second replica will go to C[n] while the third replica will be round robin-ed between at least 2 (or more) entities. Now in this scenario as well, 2 cases can arise:
- Case-2(a): (C1 + C2 + … + C[n-1]) / 2 <= C[n]
Now, if C[n]’s capacity is so high that it means that for every 1st and 3rd replicas placed on the lowest capacities nodes upto C[n-1], the second replica always finds space on C[n], then it implies that, if (C1 + C2 + … + C[n-1]) / 2 <= C[n], then the amount of storage that can be accommodated on available nodes with RF of 3 is the lowest among the two sides of the above equation i.e., (C1 + C2 + … + C[n-1]) / 2, as we cannot consume the full space on C[n]. - Case-2(b): (C1 + C2 + … + C[n-1]) / 2 > C[n]
But if C[n]’s capacity is not so high as in case (a), i.e., (C1 + C2 + … + C[n-1]) / 2 > C[n], then replica placements for one of the replicas will be on the largest entity C[n], while the other two replicas will round-robin amongst the other largest capacity entities (since the capacities remaining on at least 2 entities C[n-2], C[n-1] are already equal). This will continue until C[n] becomes equal to C[n-1], which is guaranteed to happen eventually because the replicas consume space on C[n] at least at a rate double than C[n-1], C[n-2], … From that point, both the second and the third replicas will continue being round robin-ed across all the remaining entities, and thus all the capacities remaining at that point can be fully consumed. Hence, in this case, the amount of storage that can be accommodated is the sum of all remaining (non-zero) entities divided by 3 (RF).
- Case-2(a): (C1 + C2 + … + C[n-1]) / 2 <= C[n]
Terminologies
Effective Usable Capacity = 95% of (Raw capacity – failover capacity based on RF)
95% because AOS stops writing to the disk when the cluster utilization reaches 95%.
Effective Capacity = Effective Usable Capacity – CVM
Extent Store Capacity = Effective Capacity / RF
Effective Capacity without Saving = Extent Store Capacity
Effective Capacity with Savings = Extent Store Capacity + Savings (Storage Efficiency & Erasure Coding)