Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-10891

Cluster Autoscaler balancing similar nodes test fails randomly

    XMLWordPrintable

Details

    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Bug Fix
    • Done

    Description

      This is a clone of issue OCPBUGS-7907. The following is the description of the original issue:

      Description of problem:

      When running the cluster-api-actuator-pkg test suite, the autoscaler test "Autoscaler should use a ClusterAutoscaler that has balance similar nodes enabled and 100 maximum total nodes [It] places nodes evenly across node groups [Slow] [autoscaler]" will fail intermittently on the Azure platform.
      
      This appears to be related to how the cluster autoscaler evaluates similarity between node groups. After instrumenting the autoscaler, i was able to see two categories of failures that were happening:
      
      1. mismatch in memory values, this is a know issue with Azure and although there is a 1.5% tolerance built in, it appears that occasionally the memory is out of that range.
      2. mismatch in resources available on the nodes, the cluster autoscaler will use nodes that exist with the cluster to determine if the node groups it is evaluating are similar. during this calculation it attempts to compare the Node.status.capacity resources. if the nodes do not have the same resource types, it will consider them as dissimilar. in some runs, it appears that the "hugepages-1Gi" and "hugepages-2Mi" resources are being added to the node after it has been running for a few minutes. this also throws off the calculations in the autoscaler.
      
      of these failures, #2 happens much more frequently.

      Version-Release number of selected component (if applicable):

      testing against master of openshift/cluster-api-actuator-pkg @ 6f6ddf733522bef5820f7345989b8d48a9a1f6cd

      How reproducible:

      the test has been failing about 50% of the runs i have observed

      Steps to Reproduce:

      1. run the cluster-api-actuator-pkg tests focusing on "balance similar nodes enabled and 100 maximum total nodes"
      2. wait for failures
      

      Actual results:

      the test fails with very limited information about why the autoscaler chose not to evaluate the node groups as similar.

      Expected results:

      the test passes

      Additional info:

      this is difficult to debug without adding more logging to the cluster autoscaler

      Attachments

        Issue Links

          Activity

            People

              mimccune@redhat.com Michael McCune
              openshift-crt-jira-prow OpenShift Prow Bot
              Zhaohua Sun Zhaohua Sun
              Jeana Routh Jeana Routh
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: