Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.15.0
Affects Version/s: 4.12
Component/s: Cluster Autoscaler
Labels:
- bug
- cluster-autoscaler

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
No

Target Backport Versions:

4.13.z, 4.12.z, 4.14.z, 4.15.z
Target Version:

4.15.0
Release Blocker:
None
Sprint:
CLOUD Sprint 245
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Priority Data:
PX Impact Score:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
* Previously, the cluster autoscaler crashed when used with nodes that have Container Storage Interface (CSI) storage. The issue is resolved in this release. (link:https://issues.redhat.com/browse/OCPBUGS-23096[*~~OCPBUGS-23096~~*])

Show
* Previously, the cluster autoscaler crashed when used with nodes that have Container Storage Interface (CSI) storage. The issue is resolved in this release. (link: https://issues.redhat.com/browse/OCPBUGS-23096 [* OCPBUGS-23096 *])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

The cluster-autoscaler-default is stuck in CrashLoopBackOff state with the below panic string reported.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1650a64]

goroutine 91 [running]:
k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits.(*CSILimits).checkAttachableInlineVolume(_, {{0xc00065be20, 0x10}, {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}}, ...)
    /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go:235 +0x6c4
k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits.(*CSILimits).filterAttachableVolumes(0xc00035a6c0, 0xc000672328, 0x4?, 0x1, 0xc000736570?)
    /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go:175 +0x625
k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits.(*CSILimits).Filter(0xc00035a6c0, {0x2fbf7e0?, 0x2f8e1a0?}, 0x7f62e3fe90c8?, 0xc000672328, 0xc0260c1200)
    /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go:103 +0x2f9
k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).runFilterPlugin(0x0?, {0x1f401c8?, 0xc00012a008?}, {0x7f6338340178?, 0xc00035a6c0?}, 0x0?, 0x0?, 0xc0004986e0?)
    /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:736 +0x2bd
k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).RunFilterPlugins(0xc00032e380, {0x1f401c8, 0xc00012a008}, 0x49?, 0x0?, 0x0?)
    /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:718 +0xfa
k8s.io/autoscaler/cluster-autoscaler/simulator.(*SchedulerBasedPredicateChecker).CheckPredicates(0xc00056f880, {0x1f4b300, 0xc000014628}, 0x49?, {0xc01e0c19d0, 0x6f})
    /go/src/k8s.io/autoscaler/cluster-autoscaler/simulator/scheduler_based_predicates_checker.go:168 +0x20d
k8s.io/autoscaler/cluster-autoscaler/core.computeExpansionOption(0xc005c50380, {0xc019e27f80, 0xb, 0x1d?}, {0x1f4bea8?, 0xc025754e00?}, 0xc01680e240, {0xc018fd2c88, 0x0, 0x0})
    /go/src/k8s.io/autoscaler/cluster-autoscaler/core/scale_up.go:293 +0x616
k8s.io/autoscaler/cluster-autoscaler/core.ScaleUp(0xc005c50380, 0xc000324d20, 0x2f8e1a0?, {0xc01e293900, 0x15, 0x20}, {0xc015f3c400, 0x52, 0x80}, {0xc015a39e00, ...}, ...)
    /go/src/k8s.io/autoscaler/cluster-autoscaler/core/scale_up.go:446 +0x4676
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc000001900, {0x4?, 0x3235343233363836?, 0x2f8e1a0?})
    /go/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:461 +0x1ff5
main.run(0x34363a2273657479?, {0x1f46838, 0xc000443a40})
    /go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:421 +0x2cd
main.main.func2({0x65722d7466696873?, 0x65642d657361656c?})
    /go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:508 +0x25
created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
    /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:211 +0x11b

Investigating further, it appears that this is a known issue that was solved in https://github.com/kubernetes/kubernetes/pull/115179.

Given that cluster-autoscaler-default stuck in CrashLoopBackOff state will impact Machine scaling capabilities its request to bring the fix from https://github.com/kubernetes/kubernetes/pull/115179 to OpenShift Container Platform 4.13 and 4.12 to prevent critical incidents from happening.

https://github.com/kubernetes/kubernetes/pull/115348 the respective commit for OpenShift Container Platform 4.12 has been made available

Version-Release number of selected component (if applicable):

OpenShift Container Platform 4.12.34

How reproducible:

Random

Steps to Reproduce:

1. It appears to be a race condition and not clear how to best trigger it

Actual results:

cluster-autoscaler-default pod stuck in CrashLoopBackOff state with panic shown below, failing to trigger Machine scaling.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1650a64]

goroutine 91 [running]:
k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits.(*CSILimits).checkAttachableInlineVolume(_, {{0xc00065be20, 0x10}, {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...}}, ...)
    /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go:235 +0x6c4
k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits.(*CSILimits).filterAttachableVolumes(0xc00035a6c0, 0xc000672328, 0x4?, 0x1, 0xc000736570?)
    /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go:175 +0x625
k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits.(*CSILimits).Filter(0xc00035a6c0, {0x2fbf7e0?, 0x2f8e1a0?}, 0x7f62e3fe90c8?, 0xc000672328, 0xc0260c1200)
    /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go:103 +0x2f9
k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).runFilterPlugin(0x0?, {0x1f401c8?, 0xc00012a008?}, {0x7f6338340178?, 0xc00035a6c0?}, 0x0?, 0x0?, 0xc0004986e0?)
    /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:736 +0x2bd
k8s.io/kubernetes/pkg/scheduler/framework/runtime.(*frameworkImpl).RunFilterPlugins(0xc00032e380, {0x1f401c8, 0xc00012a008}, 0x49?, 0x0?, 0x0?)
    /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:718 +0xfa
k8s.io/autoscaler/cluster-autoscaler/simulator.(*SchedulerBasedPredicateChecker).CheckPredicates(0xc00056f880, {0x1f4b300, 0xc000014628}, 0x49?, {0xc01e0c19d0, 0x6f})
    /go/src/k8s.io/autoscaler/cluster-autoscaler/simulator/scheduler_based_predicates_checker.go:168 +0x20d
k8s.io/autoscaler/cluster-autoscaler/core.computeExpansionOption(0xc005c50380, {0xc019e27f80, 0xb, 0x1d?}, {0x1f4bea8?, 0xc025754e00?}, 0xc01680e240, {0xc018fd2c88, 0x0, 0x0})
    /go/src/k8s.io/autoscaler/cluster-autoscaler/core/scale_up.go:293 +0x616
k8s.io/autoscaler/cluster-autoscaler/core.ScaleUp(0xc005c50380, 0xc000324d20, 0x2f8e1a0?, {0xc01e293900, 0x15, 0x20}, {0xc015f3c400, 0x52, 0x80}, {0xc015a39e00, ...}, ...)
    /go/src/k8s.io/autoscaler/cluster-autoscaler/core/scale_up.go:446 +0x4676
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc000001900, {0x4?, 0x3235343233363836?, 0x2f8e1a0?})
    /go/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:461 +0x1ff5
main.run(0x34363a2273657479?, {0x1f46838, 0xc000443a40})
    /go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:421 +0x2cd
main.main.func2({0x65722d7466696873?, 0x65642d657361656c?})
    /go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:508 +0x25
created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
    /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:211 +0x11b

Expected results:

cluster-autoscaler-default pod not to panic and enter CrashLoopBackOff state, when the kubelet has not yet posted the csinode object for the respective OpenShift Container Platform 4 - Node.

Additional info:

blocks

OCPBUGS-23270 nil pointer error in nodevolumelimits csi logging [4.14]

Closed

is cloned by

OCPBUGS-23270 nil pointer error in nodevolumelimits csi logging [4.14]

Closed

links to

Fix nil pointer error in nodevolumelimits csi logging

RHEA-2023:7198 rpm

Scaling of Machine is failing in OpenShift Container Platform 4 because cluster-autoscaler-default pod is in CrashLoopBackOff state

Assignee:: Michael McCune

Reporter:: Simon Reber

Need Info From:: None

Contributors:: None

QA Contact:: Zhaohua Sun

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/11/09 7:59 AM

Updated:: 2025/09/12 9:13 PM

Resolved:: 2024/02/27 9:02 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates