Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-54326

Autoscaler does not work after entering in failed status for a single machineautoscaler

    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None
    • Hide
      Previously the cluster autoscaler could stop scaling due to a failed machine in a machineset. This conditioned occurred due to inaccuracies in the way the cluster autoscaler counts machines in various non-running phases. Those inaccuracies have now been fixed, allowing the cluster autoscaler to have a more accurate count.
      Show
      Previously the cluster autoscaler could stop scaling due to a failed machine in a machineset. This conditioned occurred due to inaccuracies in the way the cluster autoscaler counts machines in various non-running phases. Those inaccuracies have now been fixed, allowing the cluster autoscaler to have a more accurate count.

      This is a clone of issue OCPBUGS-11115. The following is the description of the original issue:

      Description of problem:

      On march 22 the autoscaler got in to a broken state with:

      2023-03-22T12:46:49.148733289Z E0322 12:46:49.148726       1 static_autoscaler.go:364] Failed to fix node group sizes: failed to decrease MachineSet/openshift-machine-api/eu-3-compute-kgzn2-aro-machineset-compute-xl-germanywestcentral-1: attempt to delete existing nodes targetSize:4 delta:-1 existingNodes: 6

      Accordingly to the code here :

      https://github.com/openshift/kubernetes-autoscaler/blob/7aea306f3cd9951007d5c1b981bf3da770b52790/cluster-autoscaler/core/static_autoscaler.go#L421

      Once it enters in the conditional of failure in the above IF... It won't recover to check for unschedulable pods present in this IF ELSE- function

      https://github.com/openshift/kubernetes-autoscaler/blob/7aea306f3cd9951007d5c1b981bf3da770b52790/cluster-autoscaler/core/static_autoscaler.go#L509

      UNTIL it gets fixed.

      This can be seen below, where it stopped working at 2023-03-22T12:46 by not checking for pods that are not unschedulable and resumed just on March 23 at 19:14 UTC

      So tldr; Once the quota issue was resolved (this was the specific error happening for the customer) between 19:05 ({}Last time it was seen a Quota Error{}) - 19:14, the autoscaler got fixed, and resumed to work.

      I0323 19:15:21.925996       1 static_autoscaler.go:419] No unschedulable pods
      I0323 19:15:48.151241       1 static_autoscaler.go:419] No unschedulable pods
      I0323 19:14:00.640484       1 klogx.go:86] Pod eu-3-compute/compute-customer-27-7b6b98989-vrl7g is unschedulable
      I0323 19:14:31.471670       1 static_autoscaler.go:419] No unschedulable pods
      I0323 19:14:56.698237       1 static_autoscaler.go:419] No unschedulable pods
      I0322 12:46:23.716449       1 static_autoscaler.go:419] No unschedulable pods << last entry on March 22th
      I0322 12:45:08.041970       1 static_autoscaler.go:419] No unschedulable pods
      I0322 12:45:33.267542       1 static_autoscaler.go:419] No unschedulable pods
      I0322 12:45:58.492504       1 static_autoscaler.go:419] No unschedulable pods
      

      Actual results:

      Autoscaler does not work when gets into an error for a machine.

      Expected results:

      1. Entering an error for a machineautoscaler is OK. But the expectation here is that it will continuously work in case there are other healthy machineautoscalers in healthy mode.
      
      
      2. Also, the error message is not helpful. While doing further testing after the error happened, the autoscaler was not being triggered, and the message does not says that it is expected to be fixed in order to be able to run again.

              mimccune@redhat.com Michael McCune
              openshift-crt-jira-prow OpenShift Prow Bot
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: