Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.21, 4.22
Component/s: Node / Kubelet
Labels:

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.22.0
Release Blocker:
Approved
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:

[sig-auth][Feature:ProjectAPI]  TestScopedProjectAccess should succeed [apigroup:user.openshift.io][apigroup:project.openshift.io][apigroup:authorization.openshift.io] [Suite:openshift/conformance/parallel]

Significant regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 100.00% to 92.23%.

Sample (being evaluated) Release: 4.21
Start Time: 2025-11-27T00:00:00Z
End Time: 2025-12-04T16:00:00Z
Success Rate: 92.23%
Successes: 88
Failures: 8
Flakes: 7
Base (historical) Release: 4.17
Start Time: 2024-09-01T00:00:00Z
End Time: 2024-10-01T00:00:00Z
Success Rate: 100.00%
Successes: 997
Failures: 0
Flakes: 0

View the test details report for additional context.

An additional regression for :

The exact tests that fail will be somewhat random, but so far we've identified these regressions linked to this issue.

[sig-apps][Feature:DeploymentConfig] deploymentconfigs with revision history limits should never persist more old deployments than acceptable after being observed by the controller [apigroup:apps.openshift.io] [Suite:openshift/conformance/parallel]

[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks should have a working validating webhook

The problem appears to clearly begin on Nov 27, we picked up a new pattern occurring in a decent number of micro upgrade jobs.

Example of what this looks like where you can see the apiserver jamming up, etcd problems, broad disruption both in-cluster to the affected master (which we associated with ovn being cpu starved) as well as external apiserver disruption (we're guessing because etcd is choking), and a big block of failed tests.

, so we see it with lots of regressions once one of particular test gets the minimum number of required failures.

Also caused a regression with:

As well as several others popping up daily, visible in the traige page linked in comments below. The exact tests that fail very a little so we have to get lucky to even see one particular test trigger a regression.

However after breaking out the GatewayControllerAPI tests, we found occurrences that hit immediately after those tests complete, which looks like this. This may indicate GatewayControllerAPI is not the cause of what's happening. Another batch of PRs were tried with the GatewayControllerAPI tests fully removed, and the problem again showed in one of 20 runs. This feels like it made the problem less common, but it's hard to say with limited data.

Two jobs identified so far seem to show it most commonly are:

periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fips
periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade

Interestingly these are both micro upgrade jobs, we cannot yet find it in minor upgrades, but this would be odd because the problem occurs well after upgrade near the end of conformance testing.

Analyzing both disruption charts and looking for a pattern in the job pass rate where we suddenly start seeing 20-40 failures, it becomes clear our problem started on Nov 27th, this appears to be the first job run: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fips/1994146713333403648 which comes from payload 4.21.0-0.nightly-2025-11-27-173427.

Unfortunately due to a long string of red payloads on Nov 27, the changeset is huge in that payload, 160 PRs, but we can compare to what was in the prior payload with:

curl "https://sippy.dptools.openshift.org/api/payloads/diff?toPayload=4.21.0-0.nightly-2025-11-27-173427&fromPayload=4.21.0-0.nightly-2025-11-26-140450" |jq |less

Two interesting PRs show up:

Kube 1.34.2 rebase: https://github.com/openshift/kubernetes/pull/2514
ovnkube rebase: https://github.com/openshift/ovn-kubernetes/pull/2864 (no sign of the problem in pre-merge testing)

Unclear if either are related but both are suspicious, particularly the Kube rebase.

This is a clear release blocker until we have an explanation, the key things we need right now are:

Analysis of what is in the kube rebase PR that might cause this kind of performance problem.
Analysis of what exactly the kube apiserver is doing in runs such as this when the problem occurs. (22:50:21 in this case is when the trouble starts) Prow job is linked at the top of the intervals chart if you want artifacts.

Filed by: dgoodwin@redhat.com

blocks

OCPBUGS-67318 Loss of APIServer networking, etcd quorum and high CPU causing mass test failures since Nov 27

Closed

is cloned by

OCPBUGS-67318 Loss of APIServer networking, etcd quorum and high CPU causing mass test failures since Nov 27

Closed

is duplicated by

OCPBUGS-66351 NewOLMWebhookProviderOpenshiftServiceCA] test failing too often on metal

Closed

links to

openshift/machine-config-operator#5489: OCPBUGS-66420: Revert "Default Enablement of Auto Sizing Reserved in OpenShift 4.21"

Retrospective Doc

Slack thread that evolves heavily since starting, most relevant portions for this bug are later

(1 links to)

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates