[TRT-1718] Add Intervals for TargetDown metrics - Red Hat Issue Tracker

Type: Story
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Fallout of https://issues.redhat.com/browse/OCPBUGS-35371

We simply do not have enough visibility into why these kubelet endpoints are going down, outside of a reboot, while kubelet itself stays up.

A big step would be charting them with the intervals. Add a new monitor test to query prometheus at the end of the run looking for when these targets were down.

Prom query:

max by (node, metrics_path) (up{job="kubelet"}) == 0

Then perhaps a test to flake if we see this happen outside of a node reboot. This seems to happen on every gcp-ovn (non-upgrade) job I look at. It does NOT seem to happen on AWS.

is cloned by

TRT-1721 Add Intervals for TargetDown metrics in 4.16

Closed

relates to

OCPBUGS-35371 Kubelet metrics endpoints experiencing prolonged outages

ASSIGNED

links to

openshift/origin#28891: TRT-1718: Add new intervals for kubelet metrics endpoints down

openshift/origin#28896: TRT-1718: Add new intervals for kubelet metrics endpoints down

Assignee:: Devan Goodwin

Reporter:: Devan Goodwin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024/06/19 11:20 AM

Updated:: 2024/06/24 1:07 PM

Resolved:: 2024/06/24 1:07 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide