-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.17.0
-
None
-
Low
-
No
-
False
-
The new test: [sig-node] kubelet metrics endpoints should always be reachable
Is picking up some upgrade job runs where we see the metrics endpoint go down for about 30 seconds, during the generic node update phase, and recover before we reboot the node. This is treated as a reason to flake the test because there was no overlap with reboot as initially written.
Example: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade/1806142925785010176
Interval chart showing the problem: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1806142925785010176/periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade/intervals?filterText=master-1&intervalFile=e2e-timelines_spyglass_20240627-024633.json&overrideDisplayFlag=0&selectedSources=E2EFailed&selectedSources=MetricsEndpointDown&selectedSources=NodeState
The master outage at 3:30:59 is causing a flake when I'd rather it didn't, because it doesn't extend into the reboot.
I'd like to tighten this up to include any overlap with update.
Will be backported to 4.16 to tighten the signal there as well.
- blocks
-
OCPBUGS-36744 New kubelet metrics test should ignore outages during node update, not just reboot
- MODIFIED
- is cloned by
-
OCPBUGS-36744 New kubelet metrics test should ignore outages during node update, not just reboot
- MODIFIED
- relates to
-
OCPBUGS-35371 Kubelet metrics endpoints experiencing prolonged outages
- ASSIGNED
- links to