Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.19.0
Component/s: Bare Metal Hardware Provisioning
Labels:
- component-regression

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
Yes

Target Backport Versions:
None
Target Version:

4.19.0
Release Blocker:
Approved
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

(Feel free to update this bug's summary to be more specific.)
We have a complex situation where a small number of 3-4 metal ipv6 jobs have failed upgrade. This has caused a number of test failures to regress in Component Readiness, causing close to 15 regressions at a time we desperately need stability.

The following tests are related:

[sig-storage] [sig-api-machinery] secret-upgrade

[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
[sig-network] can collect network.openshift.io/disruption-actor=poller,network.openshift.io/disruption-target=host-to-pod poller pod logs
[sig-network] can collect network.openshift.io/disruption-actor=poller,network.openshift.io/disruption-target=pod-to-service poller pod logs
[sig-network] can collect network.openshift.io/disruption-actor=poller,network.openshift.io/disruption-target=host-to-service poller pod logs
[sig-network] can collect network.openshift.io/disruption-actor=poller,network.openshift.io/disruption-target=pod-to-pod poller pod logs
[sig-network-edge] Verify DNS availability during and after upgrade success
[Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility setup
[sig-apps] replicaset-upgrade
[sig-apps] job-upgrade
[sig-apps] daemonset-upgrade
[sig-apps] deployment-upgrade
[sig-storage] [sig-api-machinery] configmap-upgrade
[sig-storage] [sig-api-machinery] secret-upgrade

Significant regression detected.
Fishers Exact probability of a regression: 99.99%.
Test pass rate dropped from 100.00% to 93.88%.

Sample (being evaluated) Release: 4.19
Start Time: 2025-04-21T00:00:00Z
End Time: 2025-04-28T12:00:00Z
Success Rate: 93.88%
Successes: 46
Failures: 3
Flakes: 0

Base (historical) Release: 4.18
Start Time: 2025-01-26T00:00:00Z
End Time: 2025-02-25T23:59:59Z
Success Rate: 100.00%
Successes: 185
Failures: 0
Flakes: 0

View the test details report for additional context.

The job runs I see causing this:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-upgrade-ovn-ipv6/1916341644907515904
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-upgrade-ovn-ipv6/1915775815010750464
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-upgrade-runc/1916679378356408320
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-upgrade-runc/1915592257860276224

These may be failing due to mirroring issues:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-upgrade-runc/1915592257860276224

shows:

time="2025-04-27T05:55:36Z" level=info msg="event interval matches E2EImagePullBackOff" locator="

{Kind map[hmsg:0aee5d1c29 namespace:e2e-k8s-sig-apps-job-upgrade-8966 node:worker-0.ostest.test.metalkube.org pod:foo-wb247]}

" message="

{BackOff Back-off pulling image \"virthost.ostest.test.metalkube.org:5000/localimages/local-test-image:e2e-7-registry-k8s-io-e2e-test-images-busybox-1-36-1-1-n3BezCOfxp98l84K\" map[firstTimestamp:2025-04-27T05:55:36Z lastTimestamp:2025-04-27T05:55:36Z reason:BackOff]}

Most of these tests involve pulling images, so I'm suspicious the whole batch could be caused by mirroring problems, but the new mirroring test does not appear to be complaining. (which was added in https://issues.redhat.com/browse/OCPBUGS-48630 )

In any case, this is a mass of regressions on component readiness, which is a release blocker. For whatever reason, upgrade does not look sufficiently stable to compare to past releases and must be fixed. These may roll off eventually but the failure pattern is consistent and it thus could return at any time.

links to

openshift/release#64339: OCPBUGS-55432: baremetal: fix regressed retries when mirroring

RHEA-2024:11038 OpenShift Container Platform 4.19.z bug fix update

Assignee:: Dmitry Tantsur

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: Jad Haj Yahya

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2025/04/28 2:18 PM

Updated:: 2025/07/13 1:30 PM

Resolved:: 2025/06/17 5:00 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide