[OCPBUGS-33592] Automatic scaling not always working because NodeGroup.GetOptions() not being implemented - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: None
Affects Version/s: 4.15
Component/s: Cloud Compute / Cluster Autoscaler
Labels:
- autoscalar
- bug

Test Coverage:

+
Severity:
Important
Regression:
No
Sprint:
CLOUD Sprint 253, CLOUD Sprint 254
sprint_count:
2
Release Blocker:
Proposed
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, an optional internal function of the cluster autoscaler caused repeated log entries when it was not implemented. The issue is resolved in this release. (link:https://issues.redhat.com/browse/OCPBUGS-33592[*~~OCPBUGS-33592~~*])

Show
* Previously, an optional internal function of the cluster autoscaler caused repeated log entries when it was not implemented. The issue is resolved in this release. (link: https://issues.redhat.com/browse/OCPBUGS-33592 [* OCPBUGS-33592 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.17.0
Target Backport Versions:

4.15
Escape Reason:
Escape Impact:
SDLC stage when should've been found:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

While investigating a problem with OpenShift Container Platform 4 - Node scaling, I found the below messages reported in my OpenShift Container Platform 4 - Cluster.

E0513 11:15:09.331353       1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c
E0513 11:15:09.331365       1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.331529       1 orchestrator.go:546] Pod project-100/curl-67f84bd857-h92wb can't be scheduled on MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0513 11:15:09.331684       1 orchestrator.go:157] No pod can fit to MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c
E0513 11:15:09.332076       1 orchestrator.go:507] Failed to get autoscaling options for node group MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c: Not implemented
I0513 11:15:09.332100       1 orchestrator.go:185] Best option to resize: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.332110       1 orchestrator.go:189] Estimated 1 nodes needed in MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.332135       1 orchestrator.go:295] Final scale-up plan: [{MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c 0->1 (max: 12)}]

The same events are reported in must-gather reviewed from customers. Given that we have https://github.com/kubernetes/autoscaler/issues/6037 and https://github.com/kubernetes/autoscaler/issues/6676 that appear to be solved via https://github.com/kubernetes/autoscaler/pull/6677 and https://github.com/kubernetes/autoscaler/pull/6038 I'm wondering whether we should pull in those changes as they seem to eventually impact automated scaling of OpenShift Container Platform 4 - Node(s).

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.15

How reproducible:

Always

Steps to Reproduce:

1. Setup OpenShift Container Platform 4 with ClusterAutoscaler configured
2. Trigger scaling activity and verify the cluster-autoscaler-default logs

Actual results:

Logs like the below are being reported.

E0513 11:15:09.331353       1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c
E0513 11:15:09.331365       1 orchestrator.go:450] Couldn't get autoscaling options for ng: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.331529       1 orchestrator.go:546] Pod project-100/curl-67f84bd857-h92wb can't be scheduled on MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0513 11:15:09.331684       1 orchestrator.go:157] No pod can fit to MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c
E0513 11:15:09.332076       1 orchestrator.go:507] Failed to get autoscaling options for node group MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c: Not implemented
I0513 11:15:09.332100       1 orchestrator.go:185] Best option to resize: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.332110       1 orchestrator.go:189] Estimated 1 nodes needed in MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.332135       1 orchestrator.go:295] Final scale-up plan: [{MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c 0->1 (max: 12)}]

Expected results:

Scale-up of OpenShift Container Platform 4 - Node to happen without error being reported

I0513 11:15:09.331529       1 orchestrator.go:546] Pod project-100/curl-67f84bd857-h92wb can't be scheduled on MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0513 11:15:09.331684       1 orchestrator.go:157] No pod can fit to MachineSet/openshift-machine-api/test-12345-batch-amd64-us-east-2c
I0513 11:15:09.332100       1 orchestrator.go:185] Best option to resize: MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.332110       1 orchestrator.go:189] Estimated 1 nodes needed in MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c
I0513 11:15:09.332135       1 orchestrator.go:295] Final scale-up plan: [{MachineSet/openshift-machine-api/test-12345-batch-arm64-us-east-2c 0->1 (max: 12)}]

Additional info:

Please review https://github.com/kubernetes/autoscaler/issues/6037 and https://github.com/kubernetes/autoscaler/issues/6676 as they seem to document the problem and also have a solution linked/merged

blocks

OCPBUGS-33932 Automatic scaling not always working because NodeGroup.GetOptions() not being implemented

Closed

is cloned by

OCPBUGS-33885 Automatic scaling not always working because NodeGroup.GetOptions() not being implemented

Closed

OCPBUGS-33932 Automatic scaling not always working because NodeGroup.GetOptions() not being implemented

Closed

links to

openshift/kubernetes-autoscaler#300: OCPBUGS-33592: fix: scale up broken for providers not implementing NodeGroup.GetOptions()

RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update

Assignee:: Michael McCune

Reporter:: Simon Reber

QA Contact:: Zhaohua Sun

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/05/13 12:22 PM

Updated:: 2024/10/01 5:32 PM

Resolved:: 2024/10/01 5:32 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates