Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.19.0
Component/s: Bare Metal Hardware Provisioning
Labels:
- component-regression

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Approved
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:

[sig-network][OCPFeatureGate:RouteExternalCertificate][Feature:Router][apigroup:route.openshift.io] with valid setup the router should support external certificate and the secret is updated then also routes are reachable [Suite:openshift/conformance/parallel]

Test has a 93.75% pass rate, but 95.00% is required.

Sample (being evaluated) Release: 4.19
Start Time: 2025-04-28T00:00:00Z
End Time: 2025-05-05T23:59:59Z
Success Rate: 93.75%
Successes: 45
Failures: 3
Flakes: 0

View the test details report for additional context.

This is much less about that specific test however, in the three failed job runs in that report you see the same pattern.

[sig-network][Feature:Whereabouts] should assign unique IP addresses to each pod in the event of a race condition case [apigroup:k8s.cni.cncf.io] [Suite:openshift/conformance/parallel]

After upgrade, well into conformance testing, we seem to lose the apiserver at least partially.

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-upgrade-runc/1918491378606673920

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-bm-upgrade/1918841761388564480

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-metal-ipi-ovn-bm-upgrade/1918499496879198208

Using the last run in the list above this line, here are the intervals showing the issue.

You can see the mass bar of apiserver disruption starting at around 6:08:20, and then tests start failing. Note that the disruption we see is from the external host in the ci cluster where tests are running, trying to reach the apiservers in the cluster. Naturally tests will fail in this state. Interestingly other tests seem to keep passing, so some requests are getting through.

I don't think this is an actual apiserver outage, we see no signs of problems in the cluster including monitoring of apiserver backends internally. We seem to lose every single external access point. We know the ci cluster itself is able to reach the network as we poll some static endpoints and there's no sign of these seeing disruption in these job runs. To me, it looks like the metal cluster environment has an inbound network issue.

The actual regressed tests here can jump around given how it causes mass failures. It's also appearing on the regression board as from this report

[sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using NetworkAttachmentDefinitions isolates overlapping CIDRs with L3 primary UDN [Suite:openshift/conformance/parallel]

Assignee:: Himanshu Roy

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: Ke Wang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2025/05/06 11:19 AM

Updated:: 2025/07/13 1:24 PM

Resolved:: 2025/05/14 5:38 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates