Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-26608

cluster-autoscaler possibly has a stale cache on machinesets that can be scaled up

XMLWordPrintable

    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      The cluster autoscaler logs claimed that:
      ❯ k logs -n openshift-machine-api cluster-autoscaler-default-79bbdb47c8-f2mhw --tail 10 -f
      I0110 15:22:34.974683       1 klogx.go:87] Pod ci-op-7y3wpmbd/dpdk-amd64-build is unschedulable
      ...
      I0110 15:22:34.974714       1 klogx.go:87] Pod ci-op-vh6l7s82/tests-private-amd64-build is unschedulable
      I0110 15:22:34.974795       1 klogx.go:87] 338 other pods are also unschedulable
      I0110 15:22:38.028472       1 orchestrator.go:168] No expansion options
      I0110 15:22:38.949588       1 eligibility.go:102] Scale-down calculation: ignoring 10 nodes unremovable in the last 5m0s
      I0110 15:22:38.949691       1 legacy.go:193] 1 nodes found to be unremovable in simulation, will re-check them at 2024-01-10 15:27:22.940617125 +0000 UTC m=+1529958.772155039
      I0110 15:22:39.944889       1 legacy.go:296] No candidates for scale down

      However, after performing a pod restart, the cluster-autoscaler noticed that there were in fact expansion options and scaled up.

       

      Version-Release number of selected component (if applicable):

      4.14.7

      How reproducible:

      Unsure

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

      Cluster autoscaler status and machinesets at the time it believed there were no expansion options available:
      
      ❯ k get cm -n openshift-machine-api cluster-autoscaler-status -oyaml
      apiVersion: v1
      data:
        status: |+
          Cluster-autoscaler status at 2024-01-10 15:53:01.081693035 +0000 UTC:
          Cluster-wide:
            Health:      Healthy (ready=31 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=31 longUnregistered=0)
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2023-12-23 22:28:17.53635074 +0000 UTC m=+13.367888721
            ScaleUp:     NoActivity (ready=31 registered=31)
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2024-01-10 14:34:06.135121145 +0000 UTC m=+1526761.966659046
            ScaleDown:   NoCandidates (candidates=0)
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2024-01-10 15:09:42.49780205 +0000 UTC m=+1528898.329339964
      
          NodeGroups:
            Name:        MachineSet/openshift-machine-api/build05-kwk66-ci-builds-worker-us-east-1a
            Health:      Healthy (ready=0 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=0 longUnregistered=0 cloudProviderTarget=0 (minSize=0, maxSize=120))
                         LastProbeTime:      0001-01-01 00:00:00 +0000 UTC
                         LastTransitionTime: 2023-12-23 22:28:17.53635074 +0000 UTC m=+13.367888721
            ScaleUp:     NoActivity (ready=0 cloudProviderTarget=0)
                         LastProbeTime:      0001-01-01 00:00:00 +0000 UTC
                         LastTransitionTime: 2024-01-02 05:13:04.845376598 +0000 UTC m=+801900.676914511
            ScaleDown:   NoCandidates (candidates=0)
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2024-01-02 05:42:03.12588419 +0000 UTC m=+803638.957422102
      
            Name:        MachineSet/openshift-machine-api/build05-kwk66-ci-longtests-worker-us-east-1a
            Health:      Healthy (ready=1 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=1 longUnregistered=0 cloudProviderTarget=1 (minSize=0, maxSize=120))
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2023-12-27 06:25:40.633849606 +0000 UTC m=+287856.465387596
            ScaleUp:     NoActivity (ready=1 cloudProviderTarget=1)
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2024-01-10 10:20:10.643142528 +0000 UTC m=+1511526.474680441
            ScaleDown:   NoCandidates (candidates=0)
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2024-01-10 15:09:42.49780205 +0000 UTC m=+1528898.329339964
      
            Name:        MachineSet/openshift-machine-api/build05-kwk66-ci-prowjobs-worker-us-east-1a
            Health:      Healthy (ready=4 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=4 longUnregistered=0 cloudProviderTarget=4 (minSize=0, maxSize=120))
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2023-12-23 22:28:17.53635074 +0000 UTC m=+13.367888721
            ScaleUp:     NoActivity (ready=4 cloudProviderTarget=4)
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2024-01-10 09:54:01.373951307 +0000 UTC m=+1509957.205489207
            ScaleDown:   NoCandidates (candidates=0)
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2024-01-10 09:54:27.396549022 +0000 UTC m=+1509983.228086936
      
            Name:        MachineSet/openshift-machine-api/build05-kwk66-ci-tests-worker-us-east-1a
            Health:      Healthy (ready=12 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=12 longUnregistered=0 cloudProviderTarget=12 (minSize=0, maxSize=120))
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2023-12-23 22:28:17.53635074 +0000 UTC m=+13.367888721
            ScaleUp:     NoActivity (ready=12 cloudProviderTarget=12)
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2024-01-10 11:39:04.279432634 +0000 UTC m=+1516260.110970547
            ScaleDown:   NoCandidates (candidates=0)
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2024-01-10 15:07:00.394322553 +0000 UTC m=+1528736.225860467
      
            Name:        MachineSet/openshift-machine-api/build05-kwk66-worker-us-east-1a
            Health:      Healthy (ready=3 unready=0 (resourceUnready=0) notStarted=0 longNotStarted=0 registered=3 longUnregistered=0 cloudProviderTarget=3 (minSize=2, maxSize=50))
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2023-12-23 22:28:17.53635074 +0000 UTC m=+13.367888721
            ScaleUp:     NoActivity (ready=3 cloudProviderTarget=3)
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2024-01-10 14:34:06.135121145 +0000 UTC m=+1526761.966659046
            ScaleDown:   NoCandidates (candidates=0)
                         LastProbeTime:      2024-01-10 15:52:43.077610751 +0000 UTC m=+1531478.909148680
                         LastTransitionTime: 2024-01-10 14:46:09.552677149 +0000 UTC m=+1527485.384215063
      
      kind: ConfigMap
      
      ❯ k get machineset -A                                                                       
      NAMESPACE               NAME                                           DESIRED   CURRENT   READY   AVAILABLE   AGE
      openshift-machine-api   build05-kwk66-ci-builds-worker-us-east-1a      0         0                             624d
      openshift-machine-api   build05-kwk66-ci-longtests-worker-us-east-1a   1         1         1       1           605d
      openshift-machine-api   build05-kwk66-ci-prowjobs-worker-us-east-1a    4         4         4       4           605d
      openshift-machine-api   build05-kwk66-ci-tests-worker-us-east-1a       12        12        12      12          624d
      openshift-machine-api   build05-kwk66-infra-us-east-1a                 2         2         2       2           112d
      openshift-machine-api   build05-kwk66-worker-us-east-1a                3         3         3       3           629d

      Expected results:

      After restarting the cluster-autoscaler pod, it performed as expected:
      
      I0110 16:08:56.001512       1 orchestrator.go:189] Best option to resize: MachineSet/openshift-machine-api/build05-kwk66-ci-builds-worker-us-east-1a
      I0110 16:08:56.001531       1 orchestrator.go:193] Estimated 138 nodes needed in MachineSet/openshift-machine-api/build05-kwk66-ci-builds-worker-us-east-1a
      I0110 16:08:56.844729       1 orchestrator.go:302] Final scale-up plan: [{MachineSet/openshift-machine-api/build05-kwk66-ci-builds-worker-us-east-1a 0->120 (max: 120)}]
      I0110 16:08:56.844772       1 orchestrator.go:584] Scale-up: setting group MachineSet/openshift-machine-api/build05-kwk66-ci-builds-worker-us-east-1a size to 120    
      
      ❯ k get machineset -A                                                                                             
      NAMESPACE               NAME                                           DESIRED   CURRENT   READY   AVAILABLE   AGE
      openshift-machine-api   build05-kwk66-ci-builds-worker-us-east-1a      120       120                           624d
      openshift-machine-api   build05-kwk66-ci-longtests-worker-us-east-1a   1         1         1       1           605d
      openshift-machine-api   build05-kwk66-ci-prowjobs-worker-us-east-1a    4         4         4       4           605d
      openshift-machine-api   build05-kwk66-ci-tests-worker-us-east-1a       12        12        12      12          624d
      openshift-machine-api   build05-kwk66-infra-us-east-1a                 2         2         2       2           112d
      openshift-machine-api   build05-kwk66-worker-us-east-1a                3         3         3       3           629d

      Additional info:

            joelspeed Joel Speed
            mshen.openshift Michael Shen
            Zhaohua Sun Zhaohua Sun
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: