Loading...

XML

Word

Printable

Type: Story
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

On OCPBUGS-35371 we're having a hard time pinning down exactly why monitoring is flagging TargetDown for the kubelet's /metrics and /metrics/cadvisor endpoints on all nodes, on periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn. The problem doesn't always make it to a firing TargetDown, seems to be around a 5 minute delay before TargetDown kicks in, but using the prom query on a job run such as this one:

max by (node, metrics_path) (up{job="kubelet"}) == 0

For every run I look at here, seems to show all worker nodes having a lengthy outage to these endpoints even if it doesn't last long enough to fire TargetDown, it's still several minutes. (use debug tools > promeceius)

Would it be possible to get visibility into what errors are coming back while scraping? Trevor points out one idea here, logging would be another option.

Complicating factor, we need this in 4.16, we cannot reproduce in 4.17 yet.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2024-06-20-14-11-29-254.png
55 kB
2024/06/20 12:11 PM
image-2024-06-20-14-12-08-313.png
107 kB
2024/06/20 12:12 PM

relates to

OCPBUGS-35371 Kubelet metrics endpoints experiencing prolonged outages

ASSIGNED

Assignee:: Unassigned

Reporter:: Devan Goodwin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/06/19 11:30 AM

Updated:: 2024/06/20 12:26 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates