Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-74973

UDN network segmentation intermittent failures on RHCOS10

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Approved
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Below is a human-written content

      There seems to be an intermittent but regular set of UDN network segmentation failures on 4.22 RHCOS 10. One example of such a job is https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.22-e2e-gcp-ovn-rt-rhcos10-techpreview/2018465846719942656

      More details to follow. Don't know anything more yet except the fact that it happens at random times, then succeeds a few times, then again fails.

      What is interesting is that this always seems to fail together with

      [sig-node][apigroup:config.openshift.io] CPU Partitioning node validation should have correct cpuset and cpushare set in crio containers [Suite:openshift/conformance/parallel]
      

      EDIT 1

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.22-e2e-gcp-ovn-rt-rhcos10-techpreview/2018518738810179584 is a much calmer CI run which exposes the same issue. The one I pasted above has some Node NotReady during the test run.

      EDIT 2

      For a moment we thought it's caused/related to the CPU partitioning bug in the RT kernel but there is a test run that failed on non-RT kernel - https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.22-e2e-azure-ovn-rhcos10-techpreview/2016570724088549376

      Below is default sippy content

      (Feel free to update this bug's summary to be more specific.)
      Component Readiness has found a potential regression in the following test:

      [sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using NetworkAttachmentDefinitions is isolated from the default network with L3 primary UDN [Suite:openshift/conformance/parallel]

      Significant regression detected.
      Fishers Exact probability of a regression: 99.99%.
      Test pass rate dropped from 100.00% to 90.24%.

      Sample (being evaluated) Release: 4.22
      Start Time: 2026-01-27T00:00:00Z
      End Time: 2026-02-03T04:00:00Z
      Success Rate: 90.24%
      Successes: 36
      Failures: 4
      Flakes: 1
      Base (historical) Release: 4.21
      Start Time: 2026-01-04T00:00:00Z
      End Time: 2026-02-03T04:00:00Z
      Success Rate: 100.00%
      Successes: 85
      Failures: 0
      Flakes: 0

      View the test details report for additional context.

      Below is an AI-generated description

      ⚠️ AI-Generated Content

      Sippy AI-assisted description; please review details for accuracy.

      Filed from: Test Regression Details

      Test Name

      [sig-network][OCPFeatureGate:NetworkSegmentation][Feature:UserDefinedPrimaryNetworks] when using openshift ovn-kubernetes created using NetworkAttachment
      Definitions is isolated from the default network with L3 primary UDN [Suite:openshift/conformance/parallel]

      Brief Overview

      Significant regression detected. Fishers Exact probability of a regression: 99.99%. Test pass rate dropped from 100.00% to 90.24%.

      Statistics Section

      Sample (being evaluated)

      Release: 4.22
      Time Period: 2026-01-27T00:00:00Z to 2026-02-03T04:00:00Z
      Success Rate: 90.24%
      Successes: 36
      Failures: 4
      Flakes: 1

      Base (historical)

      Release: 4.21
      Time Period: 2026-01-04T00:00:00Z to 2026-02-03T04:00:00Z
      Success Rate: 100.00%
      Successes: 85
      Failures: 0
      Flakes: 0

      Sample Failure Outputs

      Unable to find source-code formatter for language: text. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml
      *Job Run ID: 2016329483056844800*
      Error executing test process: wrapped process failed: exit status 1
      Reporting job state 'failed' with reason 'executing_graph:step_failed:utilizing_lease:executing_test:executing_multi_stage_test'
      
      *Job Run ID: 2016823463020335104*
      Error executing test process: wrapped process failed: exit status 124
      Process did not finish before 10m0s timeout
      Reporting job state 'failed' with reason 'executing_graph:step_failed:utilizing_lease:executing_test:executing_multi_stage_test'
      
      *Job Run ID: 2017360593417146368*
      Error executing test process: wrapped process failed: exit status 1
      Failed to create pod sandbox: rpc error: code = Unknown desc = failed to find runtime handler test-handler from runtime list
      Readiness probe failed: Get "http://10.131.0.25:81/": dial tcp 10.131.0.25:81: connect: connection refused
      Reporting job state 'failed' with reason 'executing_graph:step_failed:utilizing_lease:executing_test:executing_multi_stage_test'
      
      *Job Run ID: 2018465846719942656*
      Error executing test process: wrapped process failed: exit status 1
      Readiness probe failed: Get "http://10.131.0.17:81/": dial tcp 10.131.0.17:81: connect: connection refused
      Reporting job state 'failed' with reason 'executing_graph:step_failed:utilizing_lease:executing_test:executing_multi_stage_test'
      

      Links to Relevant Jobs

      Patterns and Insights

      The test has regressed significantly, with its success rate dropping from 100% in the base release (4.21) to 90.24% in the sample release (4.22). The failures appear to be consistent across multiple job runs, all stemming from the `periodic-ci-openshift-release-master-nightly-4.22-e2e-gcp-ovn-rt-rhcos10-techpreview` job.

      Common error messages include "Error executing test process" and "wrapped process failed: exit status 1" or "exit status 124". Several failures also show "Readiness probe failed: ... connect: connection refused", indicating potential networking or pod readiness issues. One instance also reported a timeout and a failure to find a runtime handler for a pod sandbox. This suggests a systemic issue affecting test execution or the underlying environment, possibly related to network connectivity, container runtime, or resource availability during test execution. The presence of a flake in the sample period further indicates instability.

      Filed by: mkowalsk@redhat.com

              rh-ee-arsen Arkadeep Sen (Aurko)
              mkowalsk@redhat.com Mat Kowalski
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: