You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Self hosted, Rancher-provisioned RKE2 clusters on VMWare vSphere
What did you expect to happen?:
CAS should remove the correct amount of nodes from the cluster during scale down.
What happened instead?:
The CAS running in a downstream RKE2 cluster tried to delete a single node due to low usage. The deletion process was stuck in Rancher (for reasons currently unknown to us), resulting in CAS suddenly reducing the corresponding node group size to its minimal value. In our specific case this meant that a node group with 25 nodes was suddenly reduced to a size of 1, evicting all nodes and rendering all workloads unavailable instantly.
Analysis:
We did a root cause analysis of the issue, since in our case a productive environment with customer workloads was affected. During this analysis we discovered a (in our opinion) critical bug in the scale-down process in the Rancher implementation of CAS. Looking at logs of the CAS pod right before the outage we discovered the following:
The redaced vsphere ID is always the same and corresponds to the vsphere ID of the node that was being scaled down. The logs of course contained a load of other output, which was removed here for the sake of brevity and because these lines are in our opinion enough to understand what happened.
As is visible from the log, CAS went through it's main loop several times during a period of roughly 4 minutes, each time discovering an underutilized node and scheduling it for deletion in the Rancher management cluster. During this period Rancher was unable to reconcile the deletion for reasons unknown to us, but maybe related to high load on the Rancher manager. Whatever the case, each iteration of the loop calls the following code, which also produces the log output seen above:
This function sets an annotation on the machines.cluster.x-k8s.io object in the Rancher cluster corresponding to the machine which is scheduled for deletion. Afterwards, during each iteration the node group size is reduced by one via the setSize function:
the setSize function operates on the clusters.provisioning.cattle.io object in the management cluster, containing a list of machine pools with a quantity variable governing the amount of machines in this pool.
In our case, since reconciliation of the clusters.provisioning.cattle.io object was stuck on Rancher side for several minutes, the machine was not being deleted and CAS gradually reduced the size of the machine pool until it was at it's minimal size, at which point the check
At some point, Rancher was able to reconcile it's clusters.provisioning.cattle.io resource which was modified to the point where the quantity parameter was 1 and Rancher deleted all nodes from the machine pool.
In our opinion this is a critical oversight in the implementation of the scale down process, since even in a less extreme scenario where Rancher would fail to delete the node in the time frame between two loops of CAS (happening every 10s), CAS would delete more than the required amount of nodes. The interface definition of the nodeGroup clearly states, that DeleteNode should wait for the node to be deleted.
We are aware of the fact, that setting the minimum size for the node pool to a value closer to the actual amount of nodes would have mitigated the impact somewhat (and for now we have adapted our node groups to have this value as a temporary workaround), however this should not be expected and instead the DeleteNodes function should be made more idempotent in order to mitigate such outcomes.
Suggestion:
A suggestion for a fix would be to implement a check in the markNodesForDeletion function, which checks if the deletion annotation is already set on a machine object and if so simply update the timestamp and return a custom error, which could then in turn be used to skip the setSize function. If this suggestion is acceptable, we would also be ready to provide a corresponding PR.
How to reproduce it (as minimally and precisely as possible):
Reproduction of this error is tricky, since it would require to somehow artificially keep Rancher from reconciling it's clusters.provisioning.cattle.io resource. Nevertheless, from a logical standpoint the code clearly shows, that this a thing that could happen.
Anything else we need to know?:
The text was updated successfully, but these errors were encountered:
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
Component version: 1.30.0
What k8s version are you using (
kubectl version
)?:Rancher management cluster: v1.30.6+rke2r1 (RKE2)
Downstream cluster: v1.30.6+rke2r1
What environment is this in?:
Self hosted, Rancher-provisioned RKE2 clusters on VMWare vSphere
What did you expect to happen?:
CAS should remove the correct amount of nodes from the cluster during scale down.
What happened instead?:
The CAS running in a downstream RKE2 cluster tried to delete a single node due to low usage. The deletion process was stuck in Rancher (for reasons currently unknown to us), resulting in CAS suddenly reducing the corresponding node group size to its minimal value. In our specific case this meant that a node group with 25 nodes was suddenly reduced to a size of 1, evicting all nodes and rendering all workloads unavailable instantly.
Analysis:
We did a root cause analysis of the issue, since in our case a productive environment with customer workloads was affected. During this analysis we discovered a (in our opinion) critical bug in the scale-down process in the Rancher implementation of CAS. Looking at logs of the CAS pod right before the outage we discovered the following:
The redaced vsphere ID is always the same and corresponds to the vsphere ID of the node that was being scaled down. The logs of course contained a load of other output, which was removed here for the sake of brevity and because these lines are in our opinion enough to understand what happened.
As is visible from the log, CAS went through it's main loop several times during a period of roughly 4 minutes, each time discovering an underutilized node and scheduling it for deletion in the Rancher management cluster. During this period Rancher was unable to reconcile the deletion for reasons unknown to us, but maybe related to high load on the Rancher manager. Whatever the case, each iteration of the loop calls the following code, which also produces the log output seen above:
autoscaler/cluster-autoscaler/cloudprovider/rancher/rancher_nodegroup.go
Line 109 in 55a18a3
which in turn calls the markMachineForDeletion function
autoscaler/cluster-autoscaler/cloudprovider/rancher/rancher_nodegroup.go
Line 378 in 55a18a3
This function sets an annotation on the machines.cluster.x-k8s.io object in the Rancher cluster corresponding to the machine which is scheduled for deletion. Afterwards, during each iteration the node group size is reduced by one via the setSize function:
autoscaler/cluster-autoscaler/cloudprovider/rancher/rancher_nodegroup.go
Line 251 in 55a18a3
the setSize function operates on the clusters.provisioning.cattle.io object in the management cluster, containing a list of machine pools with a quantity variable governing the amount of machines in this pool.
autoscaler/cluster-autoscaler/cloudprovider/rancher/rancher_nodegroup.go
Line 260 in 55a18a3
In our case, since reconciliation of the clusters.provisioning.cattle.io object was stuck on Rancher side for several minutes, the machine was not being deleted and CAS gradually reduced the size of the machine pool until it was at it's minimal size, at which point the check
autoscaler/cluster-autoscaler/cloudprovider/rancher/rancher_nodegroup.go
Line 110 in 55a18a3
stopped it.
At some point, Rancher was able to reconcile it's clusters.provisioning.cattle.io resource which was modified to the point where the quantity parameter was 1 and Rancher deleted all nodes from the machine pool.
In our opinion this is a critical oversight in the implementation of the scale down process, since even in a less extreme scenario where Rancher would fail to delete the node in the time frame between two loops of CAS (happening every 10s), CAS would delete more than the required amount of nodes. The interface definition of the nodeGroup clearly states, that DeleteNode should wait for the node to be deleted.
We are aware of the fact, that setting the minimum size for the node pool to a value closer to the actual amount of nodes would have mitigated the impact somewhat (and for now we have adapted our node groups to have this value as a temporary workaround), however this should not be expected and instead the DeleteNodes function should be made more idempotent in order to mitigate such outcomes.
Suggestion:
A suggestion for a fix would be to implement a check in the markNodesForDeletion function, which checks if the deletion annotation is already set on a machine object and if so simply update the timestamp and return a custom error, which could then in turn be used to skip the setSize function. If this suggestion is acceptable, we would also be ready to provide a corresponding PR.
How to reproduce it (as minimally and precisely as possible):
Reproduction of this error is tricky, since it would require to somehow artificially keep Rancher from reconciling it's clusters.provisioning.cattle.io resource. Nevertheless, from a logical standpoint the code clearly shows, that this a thing that could happen.
Anything else we need to know?:
The text was updated successfully, but these errors were encountered: