-
Story
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
False
-
None
-
False
-
-
Fallout of https://issues.redhat.com/browse/OCPBUGS-35371
We simply do not have enough visibility into why these kubelet endpoints are going down, outside of a reboot, while kubelet itself stays up.
A big step would be charting them with the intervals. Add a new monitor test to query prometheus at the end of the run looking for when these targets were down.
Prom query:
max by (node, metrics_path) (up{job="kubelet"}) == 0
Then perhaps a test to flake if we see this happen outside of a node reboot. This seems to happen on every gcp-ovn (non-upgrade) job I look at. It does NOT seem to happen on AWS.
- is cloned by
-
TRT-1721 Add Intervals for TargetDown metrics in 4.16
- Closed
- relates to
-
OCPBUGS-35371 Kubelet metrics endpoints experiencing prolonged outages
- ASSIGNED
- links to