-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.16.0, 4.17
Description of problem:
Component Readiness reveals a potential regression with the following test:
[sig-node][invariant] alert/TargetDown should not be at or above info in ns/kube-system
Currently the test details link is showing 3 recent failures similar to the following:
Jun 12 07:48:09.154 - 58s W namespace/kube-system alert/TargetDown alertstate/firing severity/warning ALERTS
Unknown macro: {alertname="TargetDown", alertstate="firing", job="kubelet", namespace="kube-system", prometheus="openshift-monitoring/k8s", service="kubelet", severity="warning"}}
This test ran 139 times in the 4 weeks before 4.15 GA and never failed once. It's failed 3 out of 31 times in the last week (with all the failures since yesterday).
Relevant slack thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1718200183009099
It seems that /metrics and /metrics/cadvisor endpoint fell over, and later recovered
- is related to
-
MON-3913 Better insight required for TargetDown failure errors
- To Do
-
OCPBUGS-36263 New kubelet metrics test should ignore outages during node update, not just reboot
- MODIFIED
-
TRT-1718 Add Intervals for TargetDown metrics
- Closed
-
TRT-1721 Add Intervals for TargetDown metrics in 4.16
- Closed