Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27750

Autoscaler should scale-from zero MachineSets that declare taints

XMLWordPrintable

    • Moderate
    • No
    • CLOUD Sprint 248
    • 1
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, an error in the parsing of taints from the `MachineSet` spec meant that the autoscaler could not account for any taint set directly on the spec. Consequently, when relying on the `MachineSet` taints for scaling from zero, the taints from the spec were not considered, which could cause incorrect scaling decisions. With this update, parsing issues within the scale from zero logic have been resolved. As a result, auto scaler can now scale up correctly and identify taints that would prevent workloads from scheduling. (link:https://issues.redhat.com/browse/OCPBUGS-27750[*OCPBUGS-27750*])
      Show
      * Previously, an error in the parsing of taints from the `MachineSet` spec meant that the autoscaler could not account for any taint set directly on the spec. Consequently, when relying on the `MachineSet` taints for scaling from zero, the taints from the spec were not considered, which could cause incorrect scaling decisions. With this update, parsing issues within the scale from zero logic have been resolved. As a result, auto scaler can now scale up correctly and identify taints that would prevent workloads from scheduling. (link: https://issues.redhat.com/browse/OCPBUGS-27750 [* OCPBUGS-27750 *])
    • Bug Fix
    • Done

      This is a clone of issue OCPBUGS-27509. The following is the description of the original issue:

      Description of problem

      When a MachineAutoscaler references a currently-zero-Machine MachineSet that includes spec.template.spec.taints, the autoscaler fails to deserialize that MachineSet, which causes it to fail to autoscale that MachineSet. The autoscaler's deserialization logic should be improved to avoid failing on the presence of taints.

      Version-Release number of selected component

      Reproduced on 4.14.10 and 4.16.0-ec.1. Expected to be every release going back to at least 4.12, based on code inspection.

      How reproducible

      Always.

      Steps to Reproduce

      With a launch 4.14.10 gcp Cluster Bot cluster (logs):

      $ oc adm upgrade
      Cluster version is 4.14.10
      
      Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
      Channel: candidate-4.14 (available channels: candidate-4.14, candidate-4.15)
      No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.
      $ oc -n openshift-machine-api get machinesets.machine.openshift.io
      NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
      ci-ln-s48f02k-72292-5z2hn-worker-a   1         1         1       1           29m
      ci-ln-s48f02k-72292-5z2hn-worker-b   1         1         1       1           29m
      ci-ln-s48f02k-72292-5z2hn-worker-c   1         1         1       1           29m
      ci-ln-s48f02k-72292-5z2hn-worker-f   0         0                             29m
      

      Pick that set with 0 nodes. They don't come with taints by default:

      $ oc -n openshift-machine-api get -o json machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f | jq '.spec.template.spec.taints'
      null
      

      So patch one in:

      $ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "add", "path": "/spec/template/spec/taints", "value": [{"effect":"NoSchedule","key":"node-role.kubernetes.io/ci","value":"ci"}
      ]}]'
      machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched
      

      And set up autoscaling:

      $ cat cluster-autoscaler.yaml
      apiVersion: autoscaling.openshift.io/v1
      kind: ClusterAutoscaler
      metadata:
        name: default
      spec:
        maxNodeProvisionTime: 30m
        scaleDown:
          enabled: true
      $ oc apply -f cluster-autoscaler.yaml 
      clusterautoscaler.autoscaling.openshift.io/default created
      

      I'm not all that familiar with autoscaling. Maybe the ClusterAutoscaler doesn't matter, and you need a MachineAutoscaler aimed at the chosen MachineSet?

      $ cat machine-autoscaler.yaml 
      apiVersion: autoscaling.openshift.io/v1beta1
      kind: MachineAutoscaler
      metadata:
        name: test
        namespace: openshift-machine-api
      spec:
        maxReplicas: 2
        minReplicas: 1
        scaleTargetRef:
          apiVersion: machine.openshift.io/v1beta1
          kind: MachineSet
          name: ci-ln-s48f02k-72292-5z2hn-worker-f
      $ oc apply -f machine-autoscaler.yaml 
      machineautoscaler.autoscaling.openshift.io/test created
      

      Checking the autoscaler's logs:

      $ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail -1 | grep taint
      W0122 19:18:47.246369       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      W0122 19:18:58.474000       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      W0122 19:19:09.703748       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      W0122 19:19:20.929617       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      ...
      

      And the MachineSet is failing to scale:

      $ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f
      NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
      ci-ln-s48f02k-72292-5z2hn-worker-f   0         0                             50m
      

      While if I remove the taint:

      $ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "remove", "path": "/spec/template/spec/taints"}]'
      machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched
      

      The autoscaler... well, it's not scaling up new Machines like I'd expected, but at least it seems to have calmed down about the taint deserialization issue:

      $ oc -n openshift-machine-api get machines.machine.openshift.io
      NAME                                       PHASE     TYPE                REGION        ZONE            AGE
      ci-ln-s48f02k-72292-5z2hn-master-0         Running   e2-custom-6-16384   us-central1   us-central1-a   53m
      ci-ln-s48f02k-72292-5z2hn-master-1         Running   e2-custom-6-16384   us-central1   us-central1-b   53m
      ci-ln-s48f02k-72292-5z2hn-master-2         Running   e2-custom-6-16384   us-central1   us-central1-c   53m
      ci-ln-s48f02k-72292-5z2hn-worker-a-fwskf   Running   e2-standard-4       us-central1   us-central1-a   45m
      ci-ln-s48f02k-72292-5z2hn-worker-b-qkwlt   Running   e2-standard-4       us-central1   us-central1-b   45m
      ci-ln-s48f02k-72292-5z2hn-worker-c-rlw4m   Running   e2-standard-4       us-central1   us-central1-c   45m
      $ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f
      NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
      ci-ln-s48f02k-72292-5z2hn-worker-f   0         0                             53m
      $ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail 50
      I0122 19:23:17.284762       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:23:17.687036       1 legacy.go:296] No candidates for scale down
      W0122 19:23:27.924167       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      I0122 19:23:28.510701       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:23:28.909507       1 legacy.go:296] No candidates for scale down
      W0122 19:23:39.148266       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      I0122 19:23:39.737359       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:23:40.135580       1 legacy.go:296] No candidates for scale down
      W0122 19:23:50.376616       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      I0122 19:23:50.963064       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:23:51.364313       1 legacy.go:296] No candidates for scale down
      W0122 19:24:01.601764       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      I0122 19:24:02.191330       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:24:02.589766       1 legacy.go:296] No candidates for scale down
      I0122 19:24:13.415183       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:24:13.815851       1 legacy.go:296] No candidates for scale down
      I0122 19:24:24.641190       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:24:25.040894       1 legacy.go:296] No candidates for scale down
      I0122 19:24:35.867194       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:24:36.266400       1 legacy.go:296] No candidates for scale down
      I0122 19:24:47.097656       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:24:47.498099       1 legacy.go:296] No candidates for scale down
      I0122 19:24:58.326025       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:24:58.726034       1 legacy.go:296] No candidates for scale down
      I0122 19:25:04.927980       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
      I0122 19:25:04.938213       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.036399ms
      I0122 19:25:09.552086       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:25:09.952094       1 legacy.go:296] No candidates for scale down
      I0122 19:25:20.778317       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:25:21.178062       1 legacy.go:296] No candidates for scale down
      I0122 19:25:32.005246       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:25:32.404966       1 legacy.go:296] No candidates for scale down
      I0122 19:25:43.233637       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:25:43.633889       1 legacy.go:296] No candidates for scale down
      I0122 19:25:54.462009       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:25:54.861513       1 legacy.go:296] No candidates for scale down
      I0122 19:26:05.688410       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:26:06.088972       1 legacy.go:296] No candidates for scale down
      I0122 19:26:16.915156       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:26:17.315987       1 legacy.go:296] No candidates for scale down
      I0122 19:26:28.143877       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:26:28.543998       1 legacy.go:296] No candidates for scale down
      I0122 19:26:39.369085       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:26:39.770386       1 legacy.go:296] No candidates for scale down
      I0122 19:26:50.596923       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:26:50.997262       1 legacy.go:296] No candidates for scale down
      I0122 19:27:01.823577       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:27:02.223290       1 legacy.go:296] No candidates for scale down
      I0122 19:27:04.938943       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
      I0122 19:27:04.947353       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 8.319938ms
      

      Actual results

      Scale-from-zero MachineAutoscaler fails on taint-deserialization when the referenced MachineSet contains spec.template.spec.taints.

      Expected results

      Scale-from-zero MachineAutoscaler works, even when the referenced MachineSet contains spec.template.spec.taints.

              joelspeed Joel Speed
              openshift-crt-jira-prow OpenShift Prow Bot
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: