[TRT-1721] Add Intervals for TargetDown metrics in 4.16 - Red Hat Issue Tracker

Type: Story
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

Target Version:

4.16.0

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Fallout of https://issues.redhat.com/browse/OCPBUGS-35371

We simply do not have enough visibility into why these kubelet endpoints are going down, outside of a reboot, while kubelet itself stays up.

A big step would be charting them with the intervals. Add a new monitor test to query prometheus at the end of the run looking for when these targets were down.

Prom query:

max by (node, metrics_path) (up{job="kubelet"}) == 0

Then perhaps a test to flake if we see this happen outside of a node reboot. This seems to happen on every gcp-ovn (non-upgrade) job I look at. It does NOT seem to happen on AWS.

clones

TRT-1718 Add Intervals for TargetDown metrics

Closed

relates to

OCPBUGS-35371 Kubelet metrics endpoints experiencing prolonged outages

ASSIGNED

links to

openshift/origin#28891: TRT-1718: Add new intervals for kubelet metrics endpoints down

openshift/origin#28896: TRT-1718: Add new intervals for kubelet metrics endpoints down

openshift/origin#28901: TRT-1721: Add new intervals for kubelet metrics endpoints down

Assignee:: Devan Goodwin

Reporter:: Devan Goodwin

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2024/06/21 12:22 PM

Updated:: 2024/06/24 1:25 PM

Resolved:: 2024/06/24 1:25 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates