Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-26239

pathological events test failed multiple times for ns/openshift-kube-scheduler

    XMLWordPrintable

Details

    • Critical
    • No
    • Proposed
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

    Description

      This is a clone of issue OCPBUGS-24537. The following is the description of the original issue:

      Description of problem:

          4.15 nightly payloads have been affected by this test multiple times:
      
      : [sig-arch] events should not repeat pathologically for ns/openshift-kube-scheduler expand_less0s{ 1 events happened too frequently
      
      event happened 21 times, something is wrong: namespace/openshift-kube-scheduler node/ci-op-2gywzc86-aa265-5skmk-master-1 pod/openshift-kube-scheduler-guard-ci-op-2gywzc86-aa265-5skmk-master-1 hmsg/2652c73da5 - reason/ProbeError Readiness probe error: Get "https://10.0.0.7:10259/healthz": dial tcp 10.0.0.7:10259: connect: connection refused result=reject
      body:
       From: 08:41:08Z To: 08:41:09Z}
      
      In each of the 10 jobs aggregated, 2 to 3 jobs failed with this test. Historically this test passed 100%. But with the past two days test data, the passing rate has dropped to 97% and aggregator started allowing this in the latest payload: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-azure-ovn-upgrade-4.15-micro-release-openshift-release-analysis-aggregator/1732295947339173888
      
      The first payload this started appearing is https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.nightly/release/4.15.0-0.nightly-2023-12-05-071627.
      
      All the events happened during cluster-operator/kube-scheduler progressing.
      
      For comparison, here is a passed job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1731936539870498816
      
      Here is a failed one: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1731936538192777216
      
      They both have the same set of probe error events. For the passing jobs, the frequency is lower than 20, while for the failed job, one of those events repeated more than 20 times and therefore results in the test failure. 

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          

      Expected results:

          

      Additional info:

          

      Attachments

        Issue Links

          Activity

            People

              jchaloup@redhat.com Jan Chaloupka
              openshift-crt-jira-prow OpenShift Prow Bot
              Rama Kasturi Narra Rama Kasturi Narra
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: