Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-27509

Autoscaler should scale-from zero MachineSets that declare taints

XMLWordPrintable

    • Moderate
    • No
    • CLOUD Sprint 248
    • 1
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, the machine autoscaler could not account for any taint set directly on the compute machine set spec due to a parsing error.
      This could cause undesired scaling behavior when relying on a compute machine set taint to scale from zero.
      The issue is resolved in this release and the machine autoscaler can now scale up correctly and identify taints that prevent workloads from scheduling.
      (link:https://issues.redhat.com/browse/OCPBUGS-27509[*OCPBUGS-27509*])
      Show
      * Previously, the machine autoscaler could not account for any taint set directly on the compute machine set spec due to a parsing error. This could cause undesired scaling behavior when relying on a compute machine set taint to scale from zero. The issue is resolved in this release and the machine autoscaler can now scale up correctly and identify taints that prevent workloads from scheduling. (link: https://issues.redhat.com/browse/OCPBUGS-27509 [* OCPBUGS-27509 *])
    • Bug Fix
    • Done

      Description of problem

      When a MachineAutoscaler references a currently-zero-Machine MachineSet that includes spec.template.spec.taints, the autoscaler fails to deserialize that MachineSet, which causes it to fail to autoscale that MachineSet. The autoscaler's deserialization logic should be improved to avoid failing on the presence of taints.

      Version-Release number of selected component

      Reproduced on 4.14.10 and 4.16.0-ec.1. Expected to be every release going back to at least 4.12, based on code inspection.

      How reproducible

      Always.

      Steps to Reproduce

      With a launch 4.14.10 gcp Cluster Bot cluster (logs):

      $ oc adm upgrade
      Cluster version is 4.14.10
      
      Upstream: https://api.integration.openshift.com/api/upgrades_info/graph
      Channel: candidate-4.14 (available channels: candidate-4.14, candidate-4.15)
      No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available.
      $ oc -n openshift-machine-api get machinesets.machine.openshift.io
      NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
      ci-ln-s48f02k-72292-5z2hn-worker-a   1         1         1       1           29m
      ci-ln-s48f02k-72292-5z2hn-worker-b   1         1         1       1           29m
      ci-ln-s48f02k-72292-5z2hn-worker-c   1         1         1       1           29m
      ci-ln-s48f02k-72292-5z2hn-worker-f   0         0                             29m
      

      Pick that set with 0 nodes. They don't come with taints by default:

      $ oc -n openshift-machine-api get -o json machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f | jq '.spec.template.spec.taints'
      null
      

      So patch one in:

      $ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "add", "path": "/spec/template/spec/taints", "value": [{"effect":"NoSchedule","key":"node-role.kubernetes.io/ci","value":"ci"}
      ]}]'
      machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched
      

      And set up autoscaling:

      $ cat cluster-autoscaler.yaml
      apiVersion: autoscaling.openshift.io/v1
      kind: ClusterAutoscaler
      metadata:
        name: default
      spec:
        maxNodeProvisionTime: 30m
        scaleDown:
          enabled: true
      $ oc apply -f cluster-autoscaler.yaml 
      clusterautoscaler.autoscaling.openshift.io/default created
      

      I'm not all that familiar with autoscaling. Maybe the ClusterAutoscaler doesn't matter, and you need a MachineAutoscaler aimed at the chosen MachineSet?

      $ cat machine-autoscaler.yaml 
      apiVersion: autoscaling.openshift.io/v1beta1
      kind: MachineAutoscaler
      metadata:
        name: test
        namespace: openshift-machine-api
      spec:
        maxReplicas: 2
        minReplicas: 1
        scaleTargetRef:
          apiVersion: machine.openshift.io/v1beta1
          kind: MachineSet
          name: ci-ln-s48f02k-72292-5z2hn-worker-f
      $ oc apply -f machine-autoscaler.yaml 
      machineautoscaler.autoscaling.openshift.io/test created
      

      Checking the autoscaler's logs:

      $ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail -1 | grep taint
      W0122 19:18:47.246369       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      W0122 19:18:58.474000       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      W0122 19:19:09.703748       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      W0122 19:19:20.929617       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      ...
      

      And the MachineSet is failing to scale:

      $ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f
      NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
      ci-ln-s48f02k-72292-5z2hn-worker-f   0         0                             50m
      

      While if I remove the taint:

      $ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "remove", "path": "/spec/template/spec/taints"}]'
      machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched
      

      The autoscaler... well, it's not scaling up new Machines like I'd expected, but at least it seems to have calmed down about the taint deserialization issue:

      $ oc -n openshift-machine-api get machines.machine.openshift.io
      NAME                                       PHASE     TYPE                REGION        ZONE            AGE
      ci-ln-s48f02k-72292-5z2hn-master-0         Running   e2-custom-6-16384   us-central1   us-central1-a   53m
      ci-ln-s48f02k-72292-5z2hn-master-1         Running   e2-custom-6-16384   us-central1   us-central1-b   53m
      ci-ln-s48f02k-72292-5z2hn-master-2         Running   e2-custom-6-16384   us-central1   us-central1-c   53m
      ci-ln-s48f02k-72292-5z2hn-worker-a-fwskf   Running   e2-standard-4       us-central1   us-central1-a   45m
      ci-ln-s48f02k-72292-5z2hn-worker-b-qkwlt   Running   e2-standard-4       us-central1   us-central1-b   45m
      ci-ln-s48f02k-72292-5z2hn-worker-c-rlw4m   Running   e2-standard-4       us-central1   us-central1-c   45m
      $ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f
      NAME                                 DESIRED   CURRENT   READY   AVAILABLE   AGE
      ci-ln-s48f02k-72292-5z2hn-worker-f   0         0                             53m
      $ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail 50
      I0122 19:23:17.284762       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:23:17.687036       1 legacy.go:296] No candidates for scale down
      W0122 19:23:27.924167       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      I0122 19:23:28.510701       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:23:28.909507       1 legacy.go:296] No candidates for scale down
      W0122 19:23:39.148266       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      I0122 19:23:39.737359       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:23:40.135580       1 legacy.go:296] No candidates for scale down
      W0122 19:23:50.376616       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      I0122 19:23:50.963064       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:23:51.364313       1 legacy.go:296] No candidates for scale down
      W0122 19:24:01.601764       1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci]
      I0122 19:24:02.191330       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:24:02.589766       1 legacy.go:296] No candidates for scale down
      I0122 19:24:13.415183       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:24:13.815851       1 legacy.go:296] No candidates for scale down
      I0122 19:24:24.641190       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:24:25.040894       1 legacy.go:296] No candidates for scale down
      I0122 19:24:35.867194       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:24:36.266400       1 legacy.go:296] No candidates for scale down
      I0122 19:24:47.097656       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:24:47.498099       1 legacy.go:296] No candidates for scale down
      I0122 19:24:58.326025       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:24:58.726034       1 legacy.go:296] No candidates for scale down
      I0122 19:25:04.927980       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
      I0122 19:25:04.938213       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.036399ms
      I0122 19:25:09.552086       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:25:09.952094       1 legacy.go:296] No candidates for scale down
      I0122 19:25:20.778317       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:25:21.178062       1 legacy.go:296] No candidates for scale down
      I0122 19:25:32.005246       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:25:32.404966       1 legacy.go:296] No candidates for scale down
      I0122 19:25:43.233637       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:25:43.633889       1 legacy.go:296] No candidates for scale down
      I0122 19:25:54.462009       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:25:54.861513       1 legacy.go:296] No candidates for scale down
      I0122 19:26:05.688410       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:26:06.088972       1 legacy.go:296] No candidates for scale down
      I0122 19:26:16.915156       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:26:17.315987       1 legacy.go:296] No candidates for scale down
      I0122 19:26:28.143877       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:26:28.543998       1 legacy.go:296] No candidates for scale down
      I0122 19:26:39.369085       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:26:39.770386       1 legacy.go:296] No candidates for scale down
      I0122 19:26:50.596923       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:26:50.997262       1 legacy.go:296] No candidates for scale down
      I0122 19:27:01.823577       1 static_autoscaler.go:552] No unschedulable pods
      I0122 19:27:02.223290       1 legacy.go:296] No candidates for scale down
      I0122 19:27:04.938943       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
      I0122 19:27:04.947353       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 8.319938ms
      

      Actual results

      Scale-from-zero MachineAutoscaler fails on taint-deserialization when the referenced MachineSet contains spec.template.spec.taints.

      Expected results

      Scale-from-zero MachineAutoscaler works, even when the referenced MachineSet contains spec.template.spec.taints.

              joelspeed Joel Speed
              trking W. Trevor King
              Zhaohua Sun Zhaohua Sun
              Jeana Routh Jeana Routh
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: