Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: openvswitch3.3
Labels:
- Triaged

Story Points:
13
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Acceptance Criteria:

Hide

Given a system that needs to handle high volumes of requests from thousands of different IP sources with OVS,

When the system receives a sustained load of requests exceeding 75000 per second with varied source IPs,

Then OVS should handle the traffic without generating excessive upcalls that would cause CPU overload and failure of health checks for the pods.

Show
Given a system that needs to handle high volumes of requests from thousands of different IP sources with OVS, When the system receives a sustained load of requests exceeding 75000 per second with varied source IPs, Then OVS should handle the traffic without generating excessive upcalls that would cause CPU overload and failure of health checks for the pods.
OS:
rhel-9
Planning:
None
AssignedTeam:
rhel-net-ovs-dpdk
Intelligence Requested:
Internal Whiteboard:
Market:
Sub-System Group:

ssg_networking

Sprint:
OVS/DPDK - FDP-25.E - 2
sprint_count:
1
Severity:
Important

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

//ISSUE:

Customer has a product application that is running on two well-provisioned worker nodes. This application is exposed, using metalLB and a load test is being sent using UDP traffic that exceeds a rate of 75k qps

They are aiming to achieve a rate of 120k requests per second and observe failures before they hit that period. The backends of the application (and multiple other applications on the nodes being hammered by request rate) start to fail their local kubelet --> pod liveness probes.

Health checks begin to fail, and pods are marked as not/ready. This removes the target application from the MetalLB as a valid backend and causes a flapping behavior on all pods on the hosts.

This is a continuance of: https://issues.redhat.com/browse/FDP-579 which was addressed/resolved with an increased conntrack bucket size embed + a recommended upgrade to 4.16. However, we are revisiting this problem because it has been observed that our previous problem replicator was using only a few localized IP addresses/tuples in it's requests to the nodes. When testing in production, we observed that we immediately overwhelm OVS on the nodes, and the primary difference is that we have THOUSANDS of unique IP's/tuples establishing these requests, relative to the 5 or so localized IP's from the same load-test we were running internally.

We modified the local replicator to have several thousand IP addresses and immediately replicated the behavior.

Local nodes still have a conntrack bucket override to ensure we aren't hitting the problem outlined previously. This behavior appears to be the same symptoms, but networking team believes it is probable we are making a new upcall request for each new tuple and it can't keep up.

Need assistance reviewing available data samples.

https://access.redhat.com/solutions/7098373 published for this issue

4.14.14, OVN-kubernetes.

traffic flow, details and case information in first comment (internal).

relates to

FDP-992 northd should not create IPv6 prefix delegation logical flows if none configured

Closed

OCPBUGS-45343 [4.16] Update OVN to FDP 24.F minor release

Closed

FDP-1024 ovn-controller should set flow table prefixes for IPv6

Closed

Assignee:: Mike Pattrick

Reporter:: Will Russell

Contributors:: Aaron Conole

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/11/06 5:59 PM

Updated:: 2025/09/13 9:28 PM

Resolved:: 2025/07/17 6:05 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates