-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.21
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
-
Approved
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:
[sig-auth][Feature:ProjectAPI] TestScopedProjectAccess should succeed [apigroup:user.openshift.io][apigroup:project.openshift.io][apigroup:authorization.openshift.io] [Suite:openshift/conformance/parallel]
Significant regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 100.00% to 92.23%.
Sample (being evaluated) Release: 4.21
Start Time: 2025-11-27T00:00:00Z
End Time: 2025-12-04T16:00:00Z
Success Rate: 92.23%
Successes: 88
Failures: 8
Flakes: 7
Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T00:00:00Z
Success Rate: 100.00%
Successes: 997
Failures: 0
Flakes: 0
View the test details report for additional context.
An additional regression for :
The exact tests that fail will be somewhat random, but so far we've identified these regressions linked to this issue.
[sig-apps][Feature:DeploymentConfig] deploymentconfigs with revision history limits should never persist more old deployments than acceptable after being observed by the controller [apigroup:apps.openshift.io] [Suite:openshift/conformance/parallel]
[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks should have a working validating webhook
The problem appears to clearly begin on Nov 27, we picked up a new pattern occurring in a decent number of micro upgrade jobs.
Example of what this looks like where you can see the apiserver jamming up, etcd problems, broad disruption both in-cluster to the affected master (which we associated with ovn being cpu starved) as well as external apiserver disruption (we're guessing because etcd is choking), and a big block of failed tests.
, so we see it with lots of regressions once one of particular test gets the minimum number of required failures.
Also caused a regression with:
As well as several others popping up daily, visible in the traige page linked in comments below. The exact tests that fail very a little so we have to get lucky to even see one particular test trigger a regression.
However after breaking out the GatewayControllerAPI tests, we found occurrences that hit immediately after those tests complete, which looks like this. This may indicate GatewayControllerAPI is not the cause of what's happening. Another batch of PRs were tried with the GatewayControllerAPI tests fully removed, and the problem again showed in one of 20 runs. This feels like it made the problem less common, but it's hard to say with limited data.
Two jobs identified so far seem to show it most commonly are:
- periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fips
- periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade
Interestingly these are both micro upgrade jobs, we cannot yet find it in minor upgrades, but this would be odd because the problem occurs well after upgrade near the end of conformance testing.
Analyzing both disruption charts and looking for a pattern in the job pass rate where we suddenly start seeing 20-40 failures, it becomes clear our problem started on Nov 27th, this appears to be the first job run: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fips/1994146713333403648 which comes from payload 4.21.0-0.nightly-2025-11-27-173427.
Unfortunately due to a long string of red payloads on Nov 27, the changeset is huge in that payload, 160 PRs, but we can compare to what was in the prior payload with:
curl "https://sippy.dptools.openshift.org/api/payloads/diff?toPayload=4.21.0-0.nightly-2025-11-27-173427&fromPayload=4.21.0-0.nightly-2025-11-26-140450" |jq |less
Two interesting PRs show up:
- Kube 1.34.2 rebase: https://github.com/openshift/kubernetes/pull/2514
- ovnkube rebase: https://github.com/openshift/ovn-kubernetes/pull/2864 (no sign of the problem in pre-merge testing)
Unclear if either are related but both are suspicious, particularly the Kube rebase.
This is a clear release blocker until we have an explanation, the key things we need right now are:
- Analysis of what is in the kube rebase PR that might cause this kind of performance problem.
- Analysis of what exactly the kube apiserver is doing in runs such as this when the problem occurs. (22:50:21 in this case is when the trouble starts) Prow job is linked at the top of the intervals chart if you want artifacts.
Filed by: dgoodwin@redhat.com
- is duplicated by
-
OCPBUGS-66351 NewOLMWebhookProviderOpenshiftServiceCA] test failing too often on metal
-
- Closed
-
- links to