-
Bug
-
Resolution: Done-Errata
-
Major
-
4.16.0, 4.17
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
Yes
-
Done
-
Bug Fix
-
Fix a bug where the kubelet would stop reporting metrics if a stat call stalled from the kernel (like in situations where the disk that is being stat'd is run on NFS). Now, the kubelet reports metrics regardless of whether one disk is stuck
-
None
-
None
-
None
-
None
Description of problem:
Component Readiness reveals a potential regression with the following test:
[sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system
Currently the test details link is showing 3 recent failures similar to the following:
Jun 12 07:48:09.154 - 58s W namespace/kube-system alert/TargetDown alertstate/firing severity/warning ALERTS
Unknown macro: {alertname="TargetDown", alertstate="firing", job="kubelet", namespace="kube-system", prometheus="openshift-monitoring/k8s", service="kubelet", severity="warning"}}
This test ran 139 times in the 4 weeks before 4.15 GA and never failed once. It's failed 3 out of 31 times in the last week (with all the failures since yesterday).
Relevant slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1718200183009099
It seems that /metrics and /metrics/cadvisor endpoint fell over, and later recovered
- clones
-
OCPBUGS-57289 [4.17] Kubelet metrics endpoints experiencing prolonged outages
-
- Closed
-
- depends on
-
OCPBUGS-57289 [4.17] Kubelet metrics endpoints experiencing prolonged outages
-
- Closed
-
- links to
-
RHSA-2025:9765 OpenShift Container Platform 4.16.43 bug fix and security update