Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.13
Component/s: Cloud Compute / Cluster Autoscaler
Labels:
None

Severity:
Important
Regression:
No
Sprint:
CLOUD Sprint 232, CLOUD Sprint 233, CLOUD Sprint 234
sprint_count:
3
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When running the cluster-api-actuator-pkg test suite, the autoscaler test "Autoscaler should use a ClusterAutoscaler that has balance similar nodes enabled and 100 maximum total nodes [It] places nodes evenly across node groups [Slow] [autoscaler]" will fail intermittently on the Azure platform.

This appears to be related to how the cluster autoscaler evaluates similarity between node groups. After instrumenting the autoscaler, i was able to see two categories of failures that were happening:

1. mismatch in memory values, this is a know issue with Azure and although there is a 1.5% tolerance built in, it appears that occasionally the memory is out of that range.
2. mismatch in resources available on the nodes, the cluster autoscaler will use nodes that exist with the cluster to determine if the node groups it is evaluating are similar. during this calculation it attempts to compare the Node.status.capacity resources. if the nodes do not have the same resource types, it will consider them as dissimilar. in some runs, it appears that the "hugepages-1Gi" and "hugepages-2Mi" resources are being added to the node after it has been running for a few minutes. this also throws off the calculations in the autoscaler.

of these failures, #2 happens much more frequently.

Version-Release number of selected component (if applicable):

testing against master of openshift/cluster-api-actuator-pkg @ 6f6ddf733522bef5820f7345989b8d48a9a1f6cd

How reproducible:

the test has been failing about 50% of the runs i have observed

Steps to Reproduce:

1. run the cluster-api-actuator-pkg tests focusing on "balance similar nodes enabled and 100 maximum total nodes"
2. wait for failures

Actual results:

the test fails with very limited information about why the autoscaler chose not to evaluate the node groups as similar.

Expected results:

the test passes

Additional info:

this is difficult to debug without adding more logging to the cluster autoscaler

blocks

OCPBUGS-10891 Cluster Autoscaler balancing similar nodes test fails randomly

Closed

is cloned by

OCPBUGS-10891 Cluster Autoscaler balancing similar nodes test fails randomly

Closed

links to

openshift/cluster-api-actuator-pkg#272: OCPBUGS-7907: improve balance similar nodes tests

Assignee:: Michael McCune

Reporter:: Michael McCune

QA Contact:: Zhaohua Sun

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/02/22 8:14 PM

Updated:: 2023/06/05 10:53 AM

Resolved:: 2023/06/05 10:53 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates