Loading...

XML

Word

Printable

Type: Story
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Work Type:
Associate Wellness & Development
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

On ~~OCPBUGS-35371~~ we're having a hard time pinning down exactly why monitoring is flagging TargetDown for the kubelet's /metrics and /metrics/cadvisor endpoints on all nodes, on periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn. The problem doesn't always make it to a firing TargetDown, seems to be around a 5 minute delay before TargetDown kicks in, but using the prom query on a job run such as this one:

max by (node, metrics_path) (up{job="kubelet"}) == 0

For every run I look at here, seems to show all worker nodes having a lengthy outage to these endpoints even if it doesn't last long enough to fire TargetDown, it's still several minutes. (use debug tools > promeceius)

Would it be possible to get visibility into what errors are coming back while scraping? Trevor points out one idea here, logging would be another option.

Complicating factor, we need this in 4.16, we cannot reproduce in 4.17 yet.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2024-06-20-14-11-29-254.png
2024/06/20 12:11 PM
55 kB
Simon Pasquier
image-2024-06-20-14-12-08-313.png
2024/06/20 12:12 PM
107 kB
Simon Pasquier

relates to

OCPBUGS-35371 Kubelet metrics endpoints experiencing prolonged outages

Closed

Assignee:: Unassigned

Reporter:: Devan Goodwin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/06/19 11:30 AM

Updated:: 2025/08/11 8:37 AM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates