Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-66420

Loss of APIServer networking, etcd quorum and high CPU causing mass test failures since Nov 27

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Approved
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      (Feel free to update this bug's summary to be more specific.)
      Component Readiness has found a potential regression in the following test:

      [sig-auth][Feature:ProjectAPI]  TestScopedProjectAccess should succeed [apigroup:user.openshift.io][apigroup:project.openshift.io][apigroup:authorization.openshift.io] [Suite:openshift/conformance/parallel]

      Significant regression detected.
      Fishers Exact probability of a regression: 100.00%.
      Test pass rate dropped from 100.00% to 92.23%.

      Sample (being evaluated) Release: 4.21
      Start Time: 2025-11-27T00:00:00Z
      End Time: 2025-12-04T16:00:00Z
      Success Rate: 92.23%
      Successes: 88
      Failures: 8
      Flakes: 7
      Base (historical) Release: 4.17
      Start Time: 2024-09-01T00:00:00Z
      End Time: 2024-10-01T00:00:00Z
      Success Rate: 100.00%
      Successes: 997
      Failures: 0
      Flakes: 0

      View the test details report for additional context.

      An additional regression for :

      The exact tests that fail will be somewhat random, but so far we've identified these regressions linked to this issue.

      [sig-apps][Feature:DeploymentConfig] deploymentconfigs with revision history limits should never persist more old deployments than acceptable after being observed by the controller [apigroup:apps.openshift.io] [Suite:openshift/conformance/parallel]

      [sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA] OLMv1 operator with webhooks should have a working validating webhook

      The problem appears to clearly begin on Nov 27, we picked up a new pattern occurring in a decent number of micro upgrade jobs.

      Example of what this looks like where you can see the apiserver jamming up, etcd problems, broad disruption both in-cluster to the affected master (which we associated with ovn being cpu starved) as well as external apiserver disruption (we're guessing because etcd is choking), and a big block of failed tests.

      , so we see it with lots of regressions once one of particular test gets the minimum number of required failures.

      Also caused a regression with:

      As well as several others popping up daily, visible in the traige page linked in comments below. The exact tests that fail very a little so we have to get lucky to even see one particular test trigger a regression.

      However after breaking out the GatewayControllerAPI tests, we found occurrences that hit immediately after those tests complete, which looks like this. This may indicate GatewayControllerAPI is not the cause of what's happening. Another batch of PRs were tried with the GatewayControllerAPI tests fully removed, and the problem again showed in one of 20 runs. This feels like it made the problem less common, but it's hard to say with limited data.

      Two jobs identified so far seem to show it most commonly are:

      • periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fips
      • periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade

      Interestingly these are both micro upgrade jobs, we cannot yet find it in minor upgrades, but this would be odd because the problem occurs well after upgrade near the end of conformance testing.

      Analyzing both disruption charts and looking for a pattern in the job pass rate where we suddenly start seeing 20-40 failures, it becomes clear our problem started on Nov 27th, this appears to be the first job run: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fips/1994146713333403648 which comes from payload 4.21.0-0.nightly-2025-11-27-173427.

      Unfortunately due to a long string of red payloads on Nov 27, the changeset is huge in that payload, 160 PRs, but we can compare to what was in the prior payload with:

      curl "https://sippy.dptools.openshift.org/api/payloads/diff?toPayload=4.21.0-0.nightly-2025-11-27-173427&fromPayload=4.21.0-0.nightly-2025-11-26-140450" |jq |less
      

      Two interesting PRs show up:

      Unclear if either are related but both are suspicious, particularly the Kube rebase.

      This is a clear release blocker until we have an explanation, the key things we need right now are:

      • Analysis of what is in the kube rebase PR that might cause this kind of performance problem.
      • Analysis of what exactly the kube apiserver is doing in runs such as this when the problem occurs. (22:50:21 in this case is when the trouble starts) Prow job is linked at the top of the intervals chart if you want artifacts.

      Filed by: dgoodwin@redhat.com

              bluddy Ben Luddy
              openshift-trt OpenShift Technical Release Team
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

                Created:
                Updated: