Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55432

Metal Upgrades Regressed in Component Readiness

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • Yes
    • None
    • Approved
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      (Feel free to update this bug's summary to be more specific.)
      We have a complex situation where a small number of 3-4 metal ipv6 jobs have failed upgrade. This has caused a number of test failures to regress in Component Readiness, causing close to 15 regressions at a time we desperately need stability.

      The following tests are related:

      [sig-storage] [sig-api-machinery] secret-upgrade

      [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
      [sig-network] can collect network.openshift.io/disruption-actor=poller,network.openshift.io/disruption-target=host-to-pod poller pod logs
      [sig-network] can collect network.openshift.io/disruption-actor=poller,network.openshift.io/disruption-target=pod-to-service poller pod logs
      [sig-network] can collect network.openshift.io/disruption-actor=poller,network.openshift.io/disruption-target=host-to-service poller pod logs
      [sig-network] can collect network.openshift.io/disruption-actor=poller,network.openshift.io/disruption-target=pod-to-pod poller pod logs
      [sig-network-edge] Verify DNS availability during and after upgrade success
      [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility setup
      [sig-apps] replicaset-upgrade
      [sig-apps] job-upgrade
      [sig-apps] daemonset-upgrade
      [sig-apps] deployment-upgrade
      [sig-storage] [sig-api-machinery] configmap-upgrade
      [sig-storage] [sig-api-machinery] secret-upgrade

      Significant regression detected.
      Fishers Exact probability of a regression: 99.99%.
      Test pass rate dropped from 100.00% to 93.88%.

      Sample (being evaluated) Release: 4.19
      Start Time: 2025-04-21T00:00:00Z
      End Time: 2025-04-28T12:00:00Z
      Success Rate: 93.88%
      Successes: 46
      Failures: 3
      Flakes: 0

      Base (historical) Release: 4.18
      Start Time: 2025-01-26T00:00:00Z
      End Time: 2025-02-25T23:59:59Z
      Success Rate: 100.00%
      Successes: 185
      Failures: 0
      Flakes: 0

      View the test details report for additional context.

      The job runs I see causing this:

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-upgrade-ovn-ipv6/1916341644907515904
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-upgrade-ovn-ipv6/1915775815010750464
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-upgrade-runc/1916679378356408320
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-upgrade-runc/1915592257860276224

      These may be failing due to mirroring issues:

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-upgrade-runc/1915592257860276224

      shows:

      time="2025-04-27T05:55:36Z" level=info msg="event interval matches E2EImagePullBackOff" locator="

      {Kind map[hmsg:0aee5d1c29 namespace:e2e-k8s-sig-apps-job-upgrade-8966 node:worker-0.ostest.test.metalkube.org pod:foo-wb247]}

      " message="

      {BackOff Back-off pulling image \"virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-36-1-1-n3BezCOfxp98l84K\" map[firstTimestamp:2025-04-27T05:55:36Z lastTimestamp:2025-04-27T05:55:36Z reason:BackOff]}

      "

      Most of these tests involve pulling images, so I'm suspicious the whole batch could be caused by mirroring problems, but the new mirroring test does not appear to be complaining. (which was added in https://issues.redhat.com/browse/OCPBUGS-48630 )

      In any case, this is a mass of regressions on component readiness, which is a release blocker. For whatever reason, upgrade does not look sufficiently stable to compare to past releases and must be fixed. These may roll off eventually but the failure pattern is consistent and it thus could return at any time.

              rhn-engineering-dtantsur Dmitry Tantsur
              rhn-engineering-dgoodwin Devan Goodwin
              None
              None
              Jad Haj Yahya Jad Haj Yahya
              None
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: