Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-7907

Cluster Autoscaler balancing similar nodes test fails randomly



    • Important
    • CLOUD Sprint 232, CLOUD Sprint 233, CLOUD Sprint 234
    • No
    • Rejected
    • Hide




      Description of problem:

      When running the cluster-api-actuator-pkg test suite, the autoscaler test "Autoscaler should use a ClusterAutoscaler that has balance similar nodes enabled and 100 maximum total nodes [It] places nodes evenly across node groups [Slow] [autoscaler]" will fail intermittently on the Azure platform.
      This appears to be related to how the cluster autoscaler evaluates similarity between node groups. After instrumenting the autoscaler, i was able to see two categories of failures that were happening:
      1. mismatch in memory values, this is a know issue with Azure and although there is a 1.5% tolerance built in, it appears that occasionally the memory is out of that range.
      2. mismatch in resources available on the nodes, the cluster autoscaler will use nodes that exist with the cluster to determine if the node groups it is evaluating are similar. during this calculation it attempts to compare the Node.status.capacity resources. if the nodes do not have the same resource types, it will consider them as dissimilar. in some runs, it appears that the "hugepages-1Gi" and "hugepages-2Mi" resources are being added to the node after it has been running for a few minutes. this also throws off the calculations in the autoscaler.
      of these failures, #2 happens much more frequently.

      Version-Release number of selected component (if applicable):

      testing against master of openshift/cluster-api-actuator-pkg @ 6f6ddf733522bef5820f7345989b8d48a9a1f6cd

      How reproducible:

      the test has been failing about 50% of the runs i have observed

      Steps to Reproduce:

      1. run the cluster-api-actuator-pkg tests focusing on "balance similar nodes enabled and 100 maximum total nodes"
      2. wait for failures

      Actual results:

      the test fails with very limited information about why the autoscaler chose not to evaluate the node groups as similar.

      Expected results:

      the test passes

      Additional info:

      this is difficult to debug without adding more logging to the cluster autoscaler


        Issue Links



              mimccune@redhat.com Michael McCune
              mimccune@redhat.com Michael McCune
              Zhaohua Sun Zhaohua Sun
              0 Vote for this issue
              3 Start watching this issue