Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36744

New kubelet metrics test should ignore outages during node update, not just reboot

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • 4.16.z
    • 4.17.0
    • Test Framework
    • None
    • Low
    • No
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-36263. The following is the description of the original issue:

      The new test: [sig-node] kubelet metrics endpoints should always be reachable

      Is picking up some upgrade job runs where we see the metrics endpoint go down for about 30 seconds, during the generic node update phase, and recover before we reboot the node. This is treated as a reason to flake the test because there was no overlap with reboot as initially written.

      Example: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade/1806142925785010176
      Interval chart showing the problem: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1806142925785010176/periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade/intervals?filterText=master-1&intervalFile=e2e-timelines_spyglass_20240627-024633.json&overrideDisplayFlag=0&selectedSources=E2EFailed&selectedSources=MetricsEndpointDown&selectedSources=NodeState

      The master outage at 3:30:59 is causing a flake when I'd rather it didn't, because it doesn't extend into the reboot.

      I'd like to tighten this up to include any overlap with update.

      Will be backported to 4.16 to tighten the signal there as well.

              rhn-engineering-dgoodwin Devan Goodwin
              openshift-crt-jira-prow OpenShift Prow Bot
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: