Description of problem
When a MachineAutoscaler references a currently-zero-Machine MachineSet that includes spec.template.spec.taints, the autoscaler fails to deserialize that MachineSet, which causes it to fail to autoscale that MachineSet. The autoscaler's deserialization logic should be improved to avoid failing on the presence of taints.
Version-Release number of selected component
Reproduced on 4.14.10 and 4.16.0-ec.1. Expected to be every release going back to at least 4.12, based on code inspection.
How reproducible
Always.
Steps to Reproduce
With a launch 4.14.10 gcp Cluster Bot cluster (logs):
$ oc adm upgrade Cluster version is 4.14.10 Upstream: https://api.integration.openshift.com/api/upgrades_info/graph Channel: candidate-4.14 (available channels: candidate-4.14, candidate-4.15) No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available. $ oc -n openshift-machine-api get machinesets.machine.openshift.io NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-s48f02k-72292-5z2hn-worker-a 1 1 1 1 29m ci-ln-s48f02k-72292-5z2hn-worker-b 1 1 1 1 29m ci-ln-s48f02k-72292-5z2hn-worker-c 1 1 1 1 29m ci-ln-s48f02k-72292-5z2hn-worker-f 0 0 29m
Pick that set with 0 nodes. They don't come with taints by default:
$ oc -n openshift-machine-api get -o json machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f | jq '.spec.template.spec.taints' null
So patch one in:
$ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "add", "path": "/spec/template/spec/taints", "value": [{"effect":"NoSchedule","key":"node-role.kubernetes.io/ci","value":"ci"} ]}]' machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched
And set up autoscaling:
$ cat cluster-autoscaler.yaml apiVersion: autoscaling.openshift.io/v1 kind: ClusterAutoscaler metadata: name: default spec: maxNodeProvisionTime: 30m scaleDown: enabled: true $ oc apply -f cluster-autoscaler.yaml clusterautoscaler.autoscaling.openshift.io/default created
I'm not all that familiar with autoscaling. Maybe the ClusterAutoscaler doesn't matter, and you need a MachineAutoscaler aimed at the chosen MachineSet?
$ cat machine-autoscaler.yaml apiVersion: autoscaling.openshift.io/v1beta1 kind: MachineAutoscaler metadata: name: test namespace: openshift-machine-api spec: maxReplicas: 2 minReplicas: 1 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: ci-ln-s48f02k-72292-5z2hn-worker-f $ oc apply -f machine-autoscaler.yaml machineautoscaler.autoscaling.openshift.io/test created
Checking the autoscaler's logs:
$ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail -1 | grep taint W0122 19:18:47.246369 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] W0122 19:18:58.474000 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] W0122 19:19:09.703748 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] W0122 19:19:20.929617 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] ...
And the MachineSet is failing to scale:
$ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-s48f02k-72292-5z2hn-worker-f 0 0 50m
While if I remove the taint:
$ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "remove", "path": "/spec/template/spec/taints"}]' machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched
The autoscaler... well, it's not scaling up new Machines like I'd expected, but at least it seems to have calmed down about the taint deserialization issue:
$ oc -n openshift-machine-api get machines.machine.openshift.io NAME PHASE TYPE REGION ZONE AGE ci-ln-s48f02k-72292-5z2hn-master-0 Running e2-custom-6-16384 us-central1 us-central1-a 53m ci-ln-s48f02k-72292-5z2hn-master-1 Running e2-custom-6-16384 us-central1 us-central1-b 53m ci-ln-s48f02k-72292-5z2hn-master-2 Running e2-custom-6-16384 us-central1 us-central1-c 53m ci-ln-s48f02k-72292-5z2hn-worker-a-fwskf Running e2-standard-4 us-central1 us-central1-a 45m ci-ln-s48f02k-72292-5z2hn-worker-b-qkwlt Running e2-standard-4 us-central1 us-central1-b 45m ci-ln-s48f02k-72292-5z2hn-worker-c-rlw4m Running e2-standard-4 us-central1 us-central1-c 45m $ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-s48f02k-72292-5z2hn-worker-f 0 0 53m $ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail 50 I0122 19:23:17.284762 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:17.687036 1 legacy.go:296] No candidates for scale down W0122 19:23:27.924167 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:23:28.510701 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:28.909507 1 legacy.go:296] No candidates for scale down W0122 19:23:39.148266 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:23:39.737359 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:40.135580 1 legacy.go:296] No candidates for scale down W0122 19:23:50.376616 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:23:50.963064 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:51.364313 1 legacy.go:296] No candidates for scale down W0122 19:24:01.601764 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:24:02.191330 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:02.589766 1 legacy.go:296] No candidates for scale down I0122 19:24:13.415183 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:13.815851 1 legacy.go:296] No candidates for scale down I0122 19:24:24.641190 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:25.040894 1 legacy.go:296] No candidates for scale down I0122 19:24:35.867194 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:36.266400 1 legacy.go:296] No candidates for scale down I0122 19:24:47.097656 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:47.498099 1 legacy.go:296] No candidates for scale down I0122 19:24:58.326025 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:58.726034 1 legacy.go:296] No candidates for scale down I0122 19:25:04.927980 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache I0122 19:25:04.938213 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.036399ms I0122 19:25:09.552086 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:09.952094 1 legacy.go:296] No candidates for scale down I0122 19:25:20.778317 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:21.178062 1 legacy.go:296] No candidates for scale down I0122 19:25:32.005246 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:32.404966 1 legacy.go:296] No candidates for scale down I0122 19:25:43.233637 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:43.633889 1 legacy.go:296] No candidates for scale down I0122 19:25:54.462009 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:54.861513 1 legacy.go:296] No candidates for scale down I0122 19:26:05.688410 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:06.088972 1 legacy.go:296] No candidates for scale down I0122 19:26:16.915156 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:17.315987 1 legacy.go:296] No candidates for scale down I0122 19:26:28.143877 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:28.543998 1 legacy.go:296] No candidates for scale down I0122 19:26:39.369085 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:39.770386 1 legacy.go:296] No candidates for scale down I0122 19:26:50.596923 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:50.997262 1 legacy.go:296] No candidates for scale down I0122 19:27:01.823577 1 static_autoscaler.go:552] No unschedulable pods I0122 19:27:02.223290 1 legacy.go:296] No candidates for scale down I0122 19:27:04.938943 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache I0122 19:27:04.947353 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 8.319938ms
Actual results
Scale-from-zero MachineAutoscaler fails on taint-deserialization when the referenced MachineSet contains spec.template.spec.taints.
Expected results
Scale-from-zero MachineAutoscaler works, even when the referenced MachineSet contains spec.template.spec.taints.
- blocks
-
OCPBUGS-27750 Autoscaler should scale-from zero MachineSets that declare taints
- Closed
- is cloned by
-
OCPBUGS-27750 Autoscaler should scale-from zero MachineSets that declare taints
- Closed
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update