Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-62703

Monitoring operator sometimes spams kube events during upgrade

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      (Feel free to update this bug's summary to be more specific.)
      Component Readiness has found a potential regression in the following test:

      [sig-arch] events should not repeat pathologically for ns/openshift-monitoring

      Significant regression detected.
      Fishers Exact probability of a regression: 99.99%.
      Test pass rate dropped from 100.00% to 92.31%.

      Sample (being evaluated) Release: 4.20
      Start Time: 2025-09-26T00:00:00Z
      End Time: 2025-10-03T08:00:00Z
      Success Rate: 92.31%
      Successes: 36
      Failures: 3
      Flakes: 0
      Base (historical) Release: 4.18
      Start Time: 2025-01-26T00:00:00Z
      End Time: 2025-02-25T00:00:00Z
      Success Rate: 100.00%
      Successes: 145
      Failures: 0
      Flakes: 0

      View the test details report for additional context.

      The failure happens in other configurations but it's quite rare overall, so we haven't really seen this. Today it popped up in this specific metal report as it happened to hit the min 3 times.

      Error message is:

      [sig-arch] events should not repeat pathologically for ns/openshift-monitoring expand_less 	0s
      {  1 events happened too frequently
      
      event happened 25 times, something is wrong: namespace/openshift-monitoring node/worker-0 pod/prometheus-k8s-0 hmsg/357171899f - reason/Unhealthy Readiness probe errored: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1 (12:24:14Z) result=reject }
      

      And it appears it happens just after the monitoring operator is upgrading, see this chart.

      Note that this test is intended to protect the API server.

      Global test analysis can be used to find failures in all jobs, and search ci can show these specific failures over the past two days. Quite common globally.

      Filed by: dgoodwin@redhat.com

              prasriva@redhat.com Pranshu Srivastava
              openshift-trt OpenShift Technical Release Team
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: