Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.17.z
Affects Version/s: 4.16.0, 4.17
Component/s: Node / Kubelet
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
Yes

Target Backport Versions:

4.17.z, 4.16.z
Target Version:

4.16.z
Release Blocker:
Rejected
Sprint:
OCP Node Sprint 273 (Green)
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:
Fix a bug where the kubelet would stop reporting metrics if a stat call stalled from the kernel (like in situations where the disk that is being stat'd is run on NFS). Now, the kubelet reports metrics regardless of whether one disk is stuck

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Component Readiness reveals a potential regression with the following test:

[sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system

Currently the test details link is showing 3 recent failures similar to the following:

Jun 12 07:48:09.154 - 58s W namespace/kube-system alert/TargetDown alertstate/firing severity/warning ALERTS

Unknown macro: {alertname="TargetDown", alertstate="firing", job="kubelet", namespace="kube-system", prometheus="openshift-monitoring/k8s", service="kubelet", severity="warning"}

}

This test ran 139 times in the 4 weeks before 4.15 GA and never failed once. It's failed 3 out of 31 times in the last week (with all the failures since yesterday).

Relevant slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1718200183009099

It seems that /metrics and /metrics/cadvisor endpoint fell over, and later recovered

clones

OCPBUGS-57289 [4.17] Kubelet metrics endpoints experiencing prolonged outages

Closed

depends on

OCPBUGS-57289 [4.17] Kubelet metrics endpoints experiencing prolonged outages

Closed

links to

openshift/kubernetes#2325: OCPBUGS-57290: UPSTREAM: <carry>: Bump cadvisor version to fix kubelet

RHSA-2025:9765 OpenShift Container Platform 4.16.43 bug fix and security update

Assignee:: Maysa De Macedo Souza

Reporter:: Ken Zhang

QA Contact:: Min Li

Need Info From:: None

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2025/06/10 3:17 PM

Updated:: 2025/10/20 10:36 AM

Resolved:: 2025/07/02 3:53 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates