Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-37263

Investigate Cluster Autoscaler: Failed to fix node group sizes: failed to decrease

XMLWordPrintable

    • None
    • CLOUD Sprint 257, CLOUD Sprint 258, CLOUD Sprint 259
    • 3
    • False
    • Hide

      None

      Show
      None

      TODO: (Placeholder for now)

      Description of problem:

       

      The CAO can get into a failed state:

          2023-03-22T12:46:49.148733289Z E0322 12:46:49.148726       1 static_autoscaler.go:364] Failed to fix node group sizes: failed to decrease MachineSet/openshift-machine-api/eu-3-compute-kgzn2-aro-machineset-compute-xl-germanywestcentral-1: attempt to delete existing nodes targetSize:4 delta:-1 existingNodes: 6

      Version-Release number of selected component (if applicable):

      4.16    

      How reproducible:

          Yes

      Steps to Reproduce:

      oc scale deployment cluster-version-operator -n openshift-cluster-version --replicas=0     
      oc scale deployment machine-api-operator  --replicas=0
      oc scale deployment machine-api-controllers --replicas=0             
      kubectl config view --raw -o json | jq '.clusters[0].cluster."certificate-authority-data"' -r | base64 --decode  > ca.crt
      kubectl config view --raw -o json | jq '.users[0].user."client-certificate-data"' -r | base64 --decode > client.crt
      kubectl config view --raw -o json | jq '.users[0].user."client-key-data"' -r | base64 --decode > client.key
      export SERVER=$(kubectl config view --raw -o json | jq '.clusters[0].cluster.server' -r)
      export WORKER_MACHINE=zhsun-cas-r28fw-worker-us-east-2c-t576t
      curl -H "Content-Type: application/merge-patch+json" --cacert ./ca.crt --cert ./client.crt --key ./client.key $SERVER/apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/$WORKER_MACHINE/status -XPATCH -d '{"status":{"phase":"Deleting"}}'
      
      2. add workload 
      $ oc create -f ~/data/scaleup-32.yaml                               
      deployment.apps/scale-up created
      $ oc get machineset   
      NAME                                DESIRED   CURRENT   READY   AVAILABLE   AGE
      zhsun-cas-r28fw-worker-us-east-2a   3         1         1       1           10h
      zhsun-cas-r28fw-worker-us-east-2b   3         1         1       1           10h
      zhsun-cas-r28fw-worker-us-east-2c   3         1         1       1           10h
      $ oc get machine                                                    
      NAME                                      PHASE      TYPE         REGION      ZONE         AGE
      zhsun-cas-r28fw-master-0                  Running    m6i.xlarge   us-east-2   us-east-2a   10h
      zhsun-cas-r28fw-master-1                  Running    m6i.xlarge   us-east-2   us-east-2b   10h
      zhsun-cas-r28fw-master-2                  Running    m6i.xlarge   us-east-2   us-east-2c   10h
      zhsun-cas-r28fw-worker-us-east-2a-5rvgv   Running    m6i.xlarge   us-east-2   us-east-2a   134m
      zhsun-cas-r28fw-worker-us-east-2b-zn7gf   Running    m6i.xlarge   us-east-2   us-east-2b   148m
      zhsun-cas-r28fw-worker-us-east-2c-t576t   Deleting   m6i.xlarge   us-east-2   us-east-2c   72m
      $ oc get machineautoscaler                                          
      NAME                 REF KIND     REF NAME                            MIN   MAX   AGE
      machineautoscaler    MachineSet   zhsun-cas-r28fw-worker-us-east-2a   1     3     7h13m
      machineautoscalerb   MachineSet   zhsun-cas-r28fw-worker-us-east-2b   1     3     7h12m
      machineautoscalerc   MachineSet   zhsun-cas-r28fw-worker-us-east-2c   1     3     7h12m

      Actual results:

          

      Expected results:

          

      Additional info:

          

            rh-ee-tbarberb Theo Barber-Bany
            rh-ee-tbarberb Theo Barber-Bany
            Zhaohua Sun Zhaohua Sun
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: