Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-23383

Autoscaler cannot scale down the nodegroup that has Failed machine when maxNodeProvisionTime is reached

XMLWordPrintable

    • Moderate
    • Yes
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Autoscaler cannot scale down the nodegroup that has Failed machine when maxNodeProvisionTime is reached

      Version-Release number of selected component (if applicable):

      4.15.0-0.nightly-2023-11-16-173006
      This case works well for 4.14,
      and before I tested on 4.15.0-0.nightly-2023-10-09-101435, the case passed

      How reproducible:

      Always

      Steps to Reproduce:

      1.Create a machineset, replicas=0, instanceType is invalid
      liuhuali@Lius-MacBook-Pro huali-test % oc get machineset huliu-aws17a-h6mv8-worker-us-east-2a -oyaml>ms1.yaml 
      liuhuali@Lius-MacBook-Pro huali-test % vim ms1.yaml 
      liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml 
      machineset.machine.openshift.io/huliu-aws17a-h6mv8-worker-us-east-2aa created
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                         PHASE     TYPE         REGION      ZONE         AGE
      huliu-aws17a-h6mv8-master-0                  Running   m6i.xlarge   us-east-2   us-east-2a   119m
      huliu-aws17a-h6mv8-master-1                  Running   m6i.xlarge   us-east-2   us-east-2b   119m
      huliu-aws17a-h6mv8-master-2                  Running   m6i.xlarge   us-east-2   us-east-2c   119m
      huliu-aws17a-h6mv8-worker-us-east-2a-vwgfx   Running   m6i.xlarge   us-east-2   us-east-2a   115m
      huliu-aws17a-h6mv8-worker-us-east-2b-d88qr   Running   m6i.xlarge   us-east-2   us-east-2b   115m
      huliu-aws17a-h6mv8-worker-us-east-2c-xmnbg   Running   m6i.xlarge   us-east-2   us-east-2c   115m
      liuhuali@Lius-MacBook-Pro huali-test % oc get machineset
      NAME                                    DESIRED   CURRENT   READY   AVAILABLE   AGE
      huliu-aws17a-h6mv8-worker-us-east-2a    1         1         1       1           119m
      huliu-aws17a-h6mv8-worker-us-east-2aa   0         0                             16s
      huliu-aws17a-h6mv8-worker-us-east-2b    1         1         1       1           119m
      huliu-aws17a-h6mv8-worker-us-east-2c    1         1         1       1           119m 
      
      2.Create clusterautoscaler, machineautoscaler and workload
      
      liuhuali@Lius-MacBook-Pro huali-test % oc create -f clusterautoscale.yaml 
      clusterautoscaler.autoscaling.openshift.io/default created
      liuhuali@Lius-MacBook-Pro huali-test % oc create -f machineautoscaler.yaml 
      machineautoscaler.autoscaling.openshift.io/machineautoscaler-test2 created
      liuhuali@Lius-MacBook-Pro huali-test % oc get machineautoscaler
      NAME                      REF KIND     REF NAME                                MIN   MAX   AGE
      machineautoscaler-test2   MachineSet   huliu-aws17a-h6mv8-worker-us-east-2aa   0     2     11s
      liuhuali@Lius-MacBook-Pro huali-test % oc create -f workloadauto.yaml 
      job.batch/workload created 
      
      3.The machines are scale up but not scale down
      liuhuali@Lius-MacBook-Pro huali-test % oc get machineset                 
      NAME                                    DESIRED   CURRENT   READY   AVAILABLE   AGE
      huliu-aws17a-h6mv8-worker-us-east-2a    1         1         1       1           122m
      huliu-aws17a-h6mv8-worker-us-east-2aa   2         2                             3m38s
      huliu-aws17a-h6mv8-worker-us-east-2b    1         1         1       1           122m
      huliu-aws17a-h6mv8-worker-us-east-2c    1         1         1       1           122m
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine   
      NAME                                          PHASE     TYPE         REGION      ZONE         AGE
      huliu-aws17a-h6mv8-master-0                   Running   m6i.xlarge   us-east-2   us-east-2a   122m
      huliu-aws17a-h6mv8-master-1                   Running   m6i.xlarge   us-east-2   us-east-2b   122m
      huliu-aws17a-h6mv8-master-2                   Running   m6i.xlarge   us-east-2   us-east-2c   122m
      huliu-aws17a-h6mv8-worker-us-east-2a-vwgfx    Running   m6i.xlarge   us-east-2   us-east-2a   118m
      huliu-aws17a-h6mv8-worker-us-east-2aa-kpds5   Failed                                          5s
      huliu-aws17a-h6mv8-worker-us-east-2aa-zt4bn   Failed                                          5s
      huliu-aws17a-h6mv8-worker-us-east-2b-d88qr    Running   m6i.xlarge   us-east-2   us-east-2b   118m
      huliu-aws17a-h6mv8-worker-us-east-2c-xmnbg    Running   m6i.xlarge   us-east-2   us-east-2c   118m
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                          PHASE     TYPE         REGION      ZONE         AGE
      huliu-aws17a-h6mv8-master-0                   Running   m6i.xlarge   us-east-2   us-east-2a   3h45m
      huliu-aws17a-h6mv8-master-1                   Running   m6i.xlarge   us-east-2   us-east-2b   3h45m
      huliu-aws17a-h6mv8-master-2                   Running   m6i.xlarge   us-east-2   us-east-2c   3h45m
      huliu-aws17a-h6mv8-worker-us-east-2a-vwgfx    Running   m6i.xlarge   us-east-2   us-east-2a   3h40m
      huliu-aws17a-h6mv8-worker-us-east-2aa-kpds5   Failed                                          102m
      huliu-aws17a-h6mv8-worker-us-east-2aa-zt4bn   Failed                                          102m
      huliu-aws17a-h6mv8-worker-us-east-2b-d88qr    Running   m6i.xlarge   us-east-2   us-east-2b   3h40m
      huliu-aws17a-h6mv8-worker-us-east-2c-xmnbg    Running   m6i.xlarge   us-east-2   us-east-2c   3h40m
      liuhuali@Lius-MacBook-Pro huali-test %  

      Actual results:

      Autoscaler cannot scale down the nodegroup that has Failed machine when maxNodeProvisionTime is reached

      Expected results:

      Autoscaler will scale down the nodegroup that has Failed machine when maxNodeProvisionTime is reached

      Additional info:

      You can also follow the automation steps here https://github.com/openshift/openshift-tests-private/blob/master/test/extended/clusterinfrastructure/autoscaler.go#L324-L389
      Found this when run CAO regression for https://issues.redhat.com/browse/OCPCLOUD-2137 
      must gather: https://drive.google.com/file/d/1TQJArXVH6mbplNULzxSLJG8ue3GZ0LtO/view?usp=sharing

            rh-ee-nbrubake Nolan Brubaker
            huliu@redhat.com Huali Liu
            Zhaohua Sun Zhaohua Sun
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: