-
Bug
-
Resolution: Not a Bug
-
Critical
-
None
-
4.19.0
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
Approved
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:
[sig-network][OCPFeatureGate:RouteExternalCertificate][Feature:Router][apigroup:route.openshift.io] with valid setup the router should support external certificate and the secret is updated then also routes are reachable [Suite:openshift/conformance/parallel]
Test has a 93.75% pass rate, but 95.00% is required.
Sample (being evaluated) Release: 4.19
Start Time: 2025-04-28T00:00:00Z
End Time: 2025-05-05T23:59:59Z
Success Rate: 93.75%
Successes: 45
Failures: 3
Flakes: 0
View the test details report for additional context.
This is much less about that specific test however, in the three failed job runs in that report you see the same pattern.
[sig-network][Feature:Whereabouts] should assign unique IP addresses to each pod in the event of a race condition case [apigroup:k8s.cni.cncf.io] [Suite:openshift/conformance/parallel]
After upgrade, well into conformance testing, we seem to lose the apiserver at least partially.
Using the last run in the list above this line, here are the intervals showing the issue.
You can see the mass bar of apiserver disruption starting at around 6:08:20, and then tests start failing. Note that the disruption we see is from the external host in the ci cluster where tests are running, trying to reach the apiservers in the cluster. Naturally tests will fail in this state. Interestingly other tests seem to keep passing, so some requests are getting through.
I don't think this is an actual apiserver outage, we see no signs of problems in the cluster including monitoring of apiserver backends internally. We seem to lose every single external access point. We know the ci cluster itself is able to reach the network as we poll some static endpoints and there's no sign of these seeing disruption in these job runs. To me, it looks like the metal cluster environment has an inbound network issue.
The actual regressed tests here can jump around given how it causes mass failures. It's also appearing on the regression board as from this report
[sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using NetworkAttachmentDefinitions isolates overlapping CIDRs with L3 primary UDN [Suite:openshift/conformance/parallel]