[OCPBUGS-11591] Mass sig-network test failures on GCP OVN - Red Hat Issue Tracker

Type: Bug
Resolution: Not a Bug
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.14
Component/s: Networking / ovn-kubernetes
Labels:

Severity:
Critical
Regression:
Yes
Sprint:
SDN Sprint 234, SDN Sprint 235
sprint_count:
2
Release Blocker:
Approved
Blocked:
False
Blocked Reason:

Hide

This problem is blocking all payload promotion, OpenShift is unable to accept payloads effectively blocking shipping code within the organization, and we will also miss our 4.Next sprint candidate payload this time around due to this and multiple other issues.

Show
This problem is blocking all payload promotion, OpenShift is unable to accept payloads effectively blocking shipping code within the organization, and we will also miss our 4.Next sprint candidate payload this time around due to this and multiple other issues.

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Payloads are currently blocked and we are unable to accept release payloads.

These failures are primarily surfacing on one of our most common jobs: periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-ovn-upgrade where we see batches of 20-30 network tests failing together.

The problem occurs after upgrade during the conformance testing phase. The best visualization of this can be seen by opening the second spyglass intervals chart on a prow job. This chart can be used to see what else was going on in the cluster at that time)

Some examples:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-ovn-upgrade/1646082164120358912 (direct link to it's spyglass chart with the failures)

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-ovn-upgrade/1646082163382161408 (direct link to it's spyglass chart with the failures)

But essentially you can choose any run from the list in Sippy, this happens nearly 100% of the time.

This also seems to be hitting the nightly payloads beginning on April 4th again: periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade

First observed time was 2023-04-04 14:36:57+00

First payload: https://sippy.dptools.openshift.org/sippy-ng/release/4.14/tags/4.14.0-0.ci-2023-04-04-143533/pull_requests. We've tried reverting 4 of these changes but nothing helped, it is difficult to see any of these changes causing this.

Jamo also discovered this is surfacing in the less commonly run plain e2e job: periodic-ci-openshift-release-master-ci-4.14-e2e-gcp-ovn. This means the problem does not actually require an upgrade to surface, which is good as we're seeing it in post-upgrade conformance. It's bad in that whatever this is would likely impact customer clusters on a regular basis. (though it has been caught before release thankfully)

The most commonly affected test seems to be:

test=[sig-network] pods should successfully create sandboxes by adding pod to network

Sippy shows the degredation of this test quite dramatically here, where we can see for example 85% pass rate last week, 3% this week, over hundreds of runs for gcp, amd64, ovn, upgrade, upgrade-micro, ha.

Other common tests (but the precise set jumps around a bit):

test=[sig-autoscaling] [Feature:HPA] Horizontal pod autoscaling (scale resource: CPU) CustomResourceDefinition Should scale with a CRD targetRef [Suite:openshift/conformance/parallel] [Suite:k8s]

test=[sig-network] NetworkPolicyLegacy [LinuxOnly] NetworkPolicy between server and client should enforce policy to allow traffic only from a pod in a different namespace based on PodSelector and NamespaceSelector [Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Suite:openshift/conformance/parallel] [Suite:k8s]

test=[sig-network] NetworkPolicyLegacy [LinuxOnly] NetworkPolicy between server and client should enforce multiple egress policies with egress allow-all policy taking precedence [Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Skipped:Network/OpenShiftSDN] [Suite:openshift/conformance/parallel] [Suite:k8s]

The problem appears to have begun somewhere around April 4 according to the graph for the job in sippy linked in the opening portion of this description, it is at this point we see the pod sandbox test start dropping dramatically.

Sippy recently gained the ability to show us what merged between two timestamps, so here is the list of PRs that merged between April 3 - April 5

Most of the failures have a cause similar to:

Pod did not finish as expected.: timed out while waiting for pod e2e-network-policy-5418/client-can-connect-81-5sbsb to be completed

The problem is often accompanied by some disruption during the conformance phase. (not to be confused with the disruption regression that happens during upgrade and was reverted yesterday, I will try to avoid examples with both for clarity)

is duplicated by

OCPBUGS-6976 CI Failure: [sig-network] pods should successfully create sandboxes by adding pod to network

Closed

relates to

OCPBUGS-12447 Origin should generate intervals for ovs-vswitchd Unreasonably long poll interval

MODIFIED

OCPBUGS-12714 Prometheus, promtail, node exporter consuming all CPU on a system

Closed

TRT-986 Track down solutions for worker CPU alerts with node team

Closed

TRT-1021 Deploy alert for high worker CPU in CI clusters

Closed

TRT-1009 Extract metadata for ovs-vswitchd poll intervals, plot them on sippy timelines

Closed

TRT-1022 Add synthetic test for NetworkManager error seen during the GCP regression

Closed

TRT-987 Investigate worker CPU metrics in promecius

Closed

links to

550 msg slack megathread

(3 relates to, 1 links to)

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates