Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-11591

Mass sig-network test failures on GCP OVN

    XMLWordPrintable

Details

    • Critical
    • Yes
    • SDN Sprint 234, SDN Sprint 235
    • 2
    • Approved
    • False
    • Hide

      This problem is blocking all payload promotion, OpenShift is unable to accept payloads effectively blocking shipping code within the organization, and we will also miss our 4.Next sprint candidate payload this time around due to this and multiple other issues.

      Show
      This problem is blocking all payload promotion, OpenShift is unable to accept payloads effectively blocking shipping code within the organization, and we will also miss our 4.Next sprint candidate payload this time around due to this and multiple other issues.

    Description

      Payloads are currently blocked and we are unable to accept release payloads.

      These failures are primarily surfacing on one of our most common jobs: periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-ovn-upgrade where we see batches of 20-30 network tests failing together.

      The problem occurs after upgrade during the conformance testing phase. The best visualization of this can be seen by opening the second spyglass intervals chart on a prow job. This chart can be used to see what else was going on in the cluster at that time)

      Some examples:

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-ovn-upgrade/1646082164120358912 (direct link to it's spyglass chart with the failures)

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-ovn-upgrade/1646082163382161408 (direct link to it's spyglass chart with the failures)

      But essentially you can choose any run from the list in Sippy, this happens nearly 100% of the time.

       

      This also seems to be hitting the nightly payloads beginning on April 4th again: periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade

      First observed time was 2023-04-04 14:36:57+00

      First payload: https://sippy.dptools.openshift.org/sippy-ng/release/4.14/tags/4.14.0-0.ci-2023-04-04-143533/pull_requests. We've tried reverting 4 of these changes but nothing helped, it is difficult to see any of these changes causing this.

      Jamo also discovered this is surfacing in the less commonly run plain e2e job: periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-ovn. This means the problem does not actually require an upgrade to surface, which is good as we're seeing it in post-upgrade conformance. It's bad in that whatever this is would likely impact customer clusters on a regular basis. (though it has been caught before release thankfully)

      The most commonly affected test seems to be:

      test=[sig-network] pods should successfully create sandboxes by adding pod to network

      Sippy shows the degredation of this test quite dramatically here, where we can see for example 85% pass rate last week, 3% this week, over hundreds of runs for gcp, amd64, ovn, upgrade, upgrade-micro, ha.

      Other common tests (but the precise set jumps around a bit):

      test=[sig-autoscaling] [Feature:HPA] Horizontal pod autoscaling (scale resource: CPU) CustomResourceDefinition Should scale with a CRD targetRef [Suite:openshift/conformance/parallel] [Suite:k8s]

      test=[sig-network] NetworkPolicyLegacy [LinuxOnly] NetworkPolicy between server and client should enforce policy to allow traffic only from a pod in a different namespace based on PodSelector and NamespaceSelector [Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Suite:openshift/conformance/parallel] [Suite:k8s]

      test=[sig-network] NetworkPolicyLegacy [LinuxOnly] NetworkPolicy between server and client should enforce multiple egress policies with egress allow-all policy taking precedence [Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]

      The problem appears to have begun somewhere around April 4 according to the graph for the job in sippy linked in the opening portion of this description, it is at this point we see the pod sandbox test start dropping dramatically.

      Sippy recently gained the ability to show us what merged between two timestamps, so here is the list of PRs that merged between April 3 - April 5

      Most of the failures have a cause similar to:

      Pod did not finish as expected.: timed out while waiting for pod e2e-network-policy-5418/client-can-connect-81-5sbsb to be completed
      

      The problem is often accompanied by some disruption during the conformance phase. (not to be confused with the disruption regression that happens during upgrade and was reverted yesterday, I will try to avoid examples with both for clarity)

      Attachments

        Issue Links

          Activity

            People

              rhn-engineering-dgoodwin Devan Goodwin
              rh-ee-fbabcock Forrest Babcock
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: