[OCPBUGS-27509] Autoscaler should scale-from zero MachineSets that declare taints - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.16.0
Affects Version/s: 4.13, 4.12, 4.14, 4.15, 4.16
Component/s: Cloud Compute / Cluster Autoscaler
Labels:
None

Severity:
Moderate
Regression:
No
Sprint:
CLOUD Sprint 248
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, the machine autoscaler could not account for any taint set directly on the compute machine set spec due to a parsing error.
This could cause undesired scaling behavior when relying on a compute machine set taint to scale from zero.
The issue is resolved in this release and the machine autoscaler can now scale up correctly and identify taints that prevent workloads from scheduling.
(link:https://issues.redhat.com/browse/OCPBUGS-27509[*~~OCPBUGS-27509~~*])

Show
* Previously, the machine autoscaler could not account for any taint set directly on the compute machine set spec due to a parsing error. This could cause undesired scaling behavior when relying on a compute machine set taint to scale from zero. The issue is resolved in this release and the machine autoscaler can now scale up correctly and identify taints that prevent workloads from scheduling. (link: https://issues.redhat.com/browse/OCPBUGS-27509 [* OCPBUGS-27509 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.16.0
Target Backport Versions:

4.13, 4.12, 4.14, 4.15

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem

When a MachineAutoscaler references a currently-zero-Machine MachineSet that includes spec.template.spec.taints, the autoscaler fails to deserialize that MachineSet, which causes it to fail to autoscale that MachineSet. The autoscaler's deserialization logic should be improved to avoid failing on the presence of taints.

Version-Release number of selected component

Reproduced on 4.14.10 and 4.16.0-ec.1. Expected to be every release going back to at least 4.12, based on code inspection.

How reproducible

Always.

Steps to Reproduce

With a launch 4.14.10 gcp Cluster Bot cluster (logs):

$ oc adm upgrade
Cluster version is 4.14.10

Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
Channel: candidate-4.14 (available channels: candidate-4.14, candidate-4.15)
No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.
$ oc -n openshift-machine-api get machinesets.machine.openshift.io
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-ln-s48f02k-72292-5z2hn-worker-a   1         1         1       1           29m
ci-ln-s48f02k-72292-5z2hn-worker-b   1         1         1       1           29m
ci-ln-s48f02k-72292-5z2hn-worker-c   1         1         1       1           29m
ci-ln-s48f02k-72292-5z2hn-worker-f   0         0                             29m

Pick that set with 0 nodes. They don't come with taints by default:

$ oc -n openshift-machine-api get -o json machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f | jq '.spec.template.spec.taints'
null

So patch one in:

$ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "add", "path": "/spec/template/spec/taints", "value": [{"effect":"NoSchedule","key":"node-role.kubernetes.io/ci","value":"ci"}
]}]'
machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched

And set up autoscaling:

$ cat cluster-autoscaler.yaml
apiVersion: autoscaling.openshift.io/v1
kind: ClusterAutoscaler
metadata:
  name: default
spec:
  maxNodeProvisionTime: 30m
  scaleDown:
    enabled: true
$ oc apply -f cluster-autoscaler.yaml 
clusterautoscaler.autoscaling.openshift.io/default created

I'm not all that familiar with autoscaling. Maybe the ClusterAutoscaler doesn't matter, and you need a MachineAutoscaler aimed at the chosen MachineSet?

$ cat machine-autoscaler.yaml 
apiVersion: autoscaling.openshift.io/v1beta1
kind: MachineAutoscaler
metadata:
  name: test
  namespace: openshift-machine-api
spec:
  maxReplicas: 2
  minReplicas: 1
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: ci-ln-s48f02k-72292-5z2hn-worker-f
$ oc apply -f machine-autoscaler.yaml 
machineautoscaler.autoscaling.openshift.io/test created

Checking the autoscaler's logs:

$ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail -1 | grep taint
W0122 19:18:47.246369       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
W0122 19:18:58.474000       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
W0122 19:19:09.703748       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
W0122 19:19:20.929617       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
...

And the MachineSet is failing to scale:

$ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-ln-s48f02k-72292-5z2hn-worker-f   0         0                             50m

While if I remove the taint:

$ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "remove", "path": "/spec/template/spec/taints"}]'
machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched

The autoscaler... well, it's not scaling up new Machines like I'd expected, but at least it seems to have calmed down about the taint deserialization issue:

$ oc -n openshift-machine-api get machines.machine.openshift.io
NAME                                       PHASE     TYPE                REGION        ZONE            AGE
ci-ln-s48f02k-72292-5z2hn-master-0         Running   e2-custom-6-16384   us-central1   us-central1-a   53m
ci-ln-s48f02k-72292-5z2hn-master-1         Running   e2-custom-6-16384   us-central1   us-central1-b   53m
ci-ln-s48f02k-72292-5z2hn-master-2         Running   e2-custom-6-16384   us-central1   us-central1-c   53m
ci-ln-s48f02k-72292-5z2hn-worker-a-fwskf   Running   e2-standard-4       us-central1   us-central1-a   45m
ci-ln-s48f02k-72292-5z2hn-worker-b-qkwlt   Running   e2-standard-4       us-central1   us-central1-b   45m
ci-ln-s48f02k-72292-5z2hn-worker-c-rlw4m   Running   e2-standard-4       us-central1   us-central1-c   45m
$ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f
NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
ci-ln-s48f02k-72292-5z2hn-worker-f   0         0                             53m
$ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail 50
I0122 19:23:17.284762       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:23:17.687036       1 legacy.go:296] No candidates for scale down
W0122 19:23:27.924167       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
I0122 19:23:28.510701       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:23:28.909507       1 legacy.go:296] No candidates for scale down
W0122 19:23:39.148266       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
I0122 19:23:39.737359       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:23:40.135580       1 legacy.go:296] No candidates for scale down
W0122 19:23:50.376616       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
I0122 19:23:50.963064       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:23:51.364313       1 legacy.go:296] No candidates for scale down
W0122 19:24:01.601764       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
I0122 19:24:02.191330       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:02.589766       1 legacy.go:296] No candidates for scale down
I0122 19:24:13.415183       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:13.815851       1 legacy.go:296] No candidates for scale down
I0122 19:24:24.641190       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:25.040894       1 legacy.go:296] No candidates for scale down
I0122 19:24:35.867194       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:36.266400       1 legacy.go:296] No candidates for scale down
I0122 19:24:47.097656       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:47.498099       1 legacy.go:296] No candidates for scale down
I0122 19:24:58.326025       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:24:58.726034       1 legacy.go:296] No candidates for scale down
I0122 19:25:04.927980       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0122 19:25:04.938213       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.036399ms
I0122 19:25:09.552086       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:25:09.952094       1 legacy.go:296] No candidates for scale down
I0122 19:25:20.778317       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:25:21.178062       1 legacy.go:296] No candidates for scale down
I0122 19:25:32.005246       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:25:32.404966       1 legacy.go:296] No candidates for scale down
I0122 19:25:43.233637       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:25:43.633889       1 legacy.go:296] No candidates for scale down
I0122 19:25:54.462009       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:25:54.861513       1 legacy.go:296] No candidates for scale down
I0122 19:26:05.688410       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:26:06.088972       1 legacy.go:296] No candidates for scale down
I0122 19:26:16.915156       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:26:17.315987       1 legacy.go:296] No candidates for scale down
I0122 19:26:28.143877       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:26:28.543998       1 legacy.go:296] No candidates for scale down
I0122 19:26:39.369085       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:26:39.770386       1 legacy.go:296] No candidates for scale down
I0122 19:26:50.596923       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:26:50.997262       1 legacy.go:296] No candidates for scale down
I0122 19:27:01.823577       1 static_autoscaler.go:552] No unschedulable pods
I0122 19:27:02.223290       1 legacy.go:296] No candidates for scale down
I0122 19:27:04.938943       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0122 19:27:04.947353       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 8.319938ms

Actual results

Scale-from-zero MachineAutoscaler fails on taint-deserialization when the referenced MachineSet contains spec.template.spec.taints.

Expected results

Scale-from-zero MachineAutoscaler works, even when the referenced MachineSet contains spec.template.spec.taints.

blocks

OCPBUGS-27750 Autoscaler should scale-from zero MachineSets that declare taints

Closed

is cloned by

OCPBUGS-27750 Autoscaler should scale-from zero MachineSets that declare taints

Closed

links to

openshift/kubernetes-autoscaler#281: OCPBUGS-27509: Fix unstructured taint parsing in Cluster API provider

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Assignee:: Joel Speed

Reporter:: W. Trevor King

QA Contact:: Zhaohua Sun

Doc Contact:: Jeana Routh

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/01/22 7:31 PM

Updated:: 2024/06/27 11:33 AM

Resolved:: 2024/06/27 11:33 AM

Details

Description

Description of problem

Version-Release number of selected component

How reproducible

Steps to Reproduce

Actual results

Expected results

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates