Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CAS] Critical bug in cloudprovider/rancher implementation of node scale down #7474

Open
pvlkov opened this issue Nov 7, 2024 · 2 comments
Labels
area/cluster-autoscaler area/provider/rancher kind/bug Categorizes issue or PR as related to a bug.

Comments

@pvlkov
Copy link

pvlkov commented Nov 7, 2024

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: 1.30.0

What k8s version are you using (kubectl version)?:

Rancher management cluster: v1.30.6+rke2r1 (RKE2)
Downstream cluster: v1.30.6+rke2r1

What environment is this in?:

Self hosted, Rancher-provisioned RKE2 clusters on VMWare vSphere

What did you expect to happen?:

CAS should remove the correct amount of nodes from the cluster during scale down.

What happened instead?:

The CAS running in a downstream RKE2 cluster tried to delete a single node due to low usage. The deletion process was stuck in Rancher (for reasons currently unknown to us), resulting in CAS suddenly reducing the corresponding node group size to its minimal value. In our specific case this meant that a node group with 25 nodes was suddenly reduced to a size of 1, evicting all nodes and rendering all workloads unavailable instantly.

Analysis:
We did a root cause analysis of the issue, since in our case a productive environment with customer workloads was affected. During this analysis we discovered a (in our opinion) critical bug in the scale-down process in the Rancher implementation of CAS. Looking at logs of the CAS pod right before the outage we discovered the following:

2024-11-06 07:30:56.523	I1106 06:30:56.523011       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:31:06.933	I1106 06:31:06.933853       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:31:17.388	I1106 06:31:17.388816       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:31:27.851	I1106 06:31:27.851784       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:31:38.474	I1106 06:31:38.474876       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:31:48.889	I1106 06:31:48.889420       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:31:59.360	I1106 06:31:59.360176       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:32:09.967	I1106 06:32:09.967698       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:32:20.457	I1106 06:32:20.457806       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:32:31.068	I1106 06:32:31.068193       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:32:41.542	I1106 06:32:41.542084       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:32:52.057	I1106 06:32:52.057269       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:33:02.541	I1106 06:33:02.540962       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:33:12.997	I1106 06:33:12.996980       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:33:23.419	I1106 06:33:23.419709       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:33:33.858	I1106 06:33:33.858363       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:33:44.439	I1106 06:33:44.439246       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:33:54.848	I1106 06:33:54.848388       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:34:05.303	I1106 06:34:05.303319       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:34:15.889	I1106 06:34:15.889135       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:34:26.363	I1106 06:34:26.362880       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:34:37.018	I1106 06:34:37.018566       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted
2024-11-06 07:34:47.533	I1106 06:34:47.533476       1 rancher_nodegroup.go:121] marking machine for deletion: vsphere://id-redacted

The redaced vsphere ID is always the same and corresponds to the vsphere ID of the node that was being scaled down. The logs of course contained a load of other output, which was removed here for the sake of brevity and because these lines are in our opinion enough to understand what happened.

As is visible from the log, CAS went through it's main loop several times during a period of roughly 4 minutes, each time discovering an underutilized node and scheduling it for deletion in the Rancher management cluster. During this period Rancher was unable to reconcile the deletion for reasons unknown to us, but maybe related to high load on the Rancher manager. Whatever the case, each iteration of the loop calls the following code, which also produces the log output seen above:

func (ng *nodeGroup) DeleteNodes(toDelete []*corev1.Node) error {

which in turn calls the markMachineForDeletion function

func (n *node) markMachineForDeletion(ng *nodeGroup) error {

This function sets an annotation on the machines.cluster.x-k8s.io object in the Rancher cluster corresponding to the machine which is scheduled for deletion. Afterwards, during each iteration the node group size is reduced by one via the setSize function:

func (ng *nodeGroup) setSize(size int) error {

the setSize function operates on the clusters.provisioning.cattle.io object in the management cluster, containing a list of machine pools with a quantity variable governing the amount of machines in this pool.

machinePools[i].Quantity = pointer.Int32Ptr(int32(size))

In our case, since reconciliation of the clusters.provisioning.cattle.io object was stuck on Rancher side for several minutes, the machine was not being deleted and CAS gradually reduced the size of the machine pool until it was at it's minimal size, at which point the check

if ng.replicas-len(toDelete) < ng.MinSize() {

stopped it.

At some point, Rancher was able to reconcile it's clusters.provisioning.cattle.io resource which was modified to the point where the quantity parameter was 1 and Rancher deleted all nodes from the machine pool.

In our opinion this is a critical oversight in the implementation of the scale down process, since even in a less extreme scenario where Rancher would fail to delete the node in the time frame between two loops of CAS (happening every 10s), CAS would delete more than the required amount of nodes. The interface definition of the nodeGroup clearly states, that DeleteNode should wait for the node to be deleted.

We are aware of the fact, that setting the minimum size for the node pool to a value closer to the actual amount of nodes would have mitigated the impact somewhat (and for now we have adapted our node groups to have this value as a temporary workaround), however this should not be expected and instead the DeleteNodes function should be made more idempotent in order to mitigate such outcomes.

Suggestion:
A suggestion for a fix would be to implement a check in the markNodesForDeletion function, which checks if the deletion annotation is already set on a machine object and if so simply update the timestamp and return a custom error, which could then in turn be used to skip the setSize function. If this suggestion is acceptable, we would also be ready to provide a corresponding PR.

How to reproduce it (as minimally and precisely as possible):

Reproduction of this error is tricky, since it would require to somehow artificially keep Rancher from reconciling it's clusters.provisioning.cattle.io resource. Nevertheless, from a logical standpoint the code clearly shows, that this a thing that could happen.

Anything else we need to know?:

@pvlkov pvlkov added the kind/bug Categorizes issue or PR as related to a bug. label Nov 7, 2024
@pvlkov
Copy link
Author

pvlkov commented Nov 7, 2024

/area cluster-autoscaler

@pvlkov
Copy link
Author

pvlkov commented Nov 7, 2024

/area provider/rancher

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/rancher kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants