Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.19
Component/s: Cluster Autoscaler
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
AUTOSCALE - Sprint 279, AUTOSCALE - Sprint 280, AUTOSCALE - Sprint 281, AUTOSCALE - Sprint 282, AUTOSCALE - Sprint 283, AUTOSCALE - Sprint 287
sprint_count:
6

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

  When a cluster has multiple node groups managed by Cluster Autoscaler, if one node group fails to find a template node—for any reason such as:
  - All nodes being tainted as unschedulable (related to OCPBUGS-57131)
  - No nodes in ready status

Then all other node groups in the cluster also stop scaling up until that problematic node group returns to normal. This occurs even when the pending pods could be scheduled on the other healthy node groups.

Version-Release number of selected component (if applicable):

- Observed in: 4.19.15 (HCP cluster)
- Likely affects: All versions with Cluster Autoscaler

How reproducible:

Consistently reproducible when one node group enters a state where template node cannot be determined.

Steps to Reproduce:

  1. HyperShift cluster with multiple node pools (e.g., 3 node pools: m7i-2xlarge, m7a-8xlarge, rosa-core-0)
  2. Cause one node pool to lose its template (e.g., delete and recreate a node pool, causing nodes to be in non-ready state temporarily, or have all nodes tainted)
  3. Deploy pods that require resources and should trigger autoscaling
  4. Observe that pods remain pending even though other healthy node groups could accommodate them

Actual results:

E1015 11:54:15.119968       1 static_autoscaler.go:518] Failed to scale up: could not get upcoming nodes: failed to find template node for node group MachineDeployment/ocm-production-2jb1l855j59bj002nqjmnht6p8237uc5-rosaint-use1-t/rosaint-use1-t-r7-2xlarge-0

  - The error occurs even when the pending pods have tolerations and could be scheduled on other node groups
  - All node groups stop scaling, not just the problematic one
  - Multiple regular pods remain in Pending state across the cluster

Expected results:

  - Cluster Autoscaler should skip the problematic node group and continue evaluating other healthy node groups for scale-up
  - Only pods specifically requiring the problematic node group (via nodeSelector, affinity, or unique tolerations) should remain pending
  - Pods that can be scheduled on healthy node groups should trigger scale-up on those groups

Additional info:

Assignee:: Michael McCune

Reporter:: Jude Zhu

QA Contact:: Paul Rozehnal

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/10/15 9:24 PM

Updated:: 2026/02/24 8:58 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates