Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: openvswitch3.3
Labels:
- escalation

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Pool Team:

rhel-sst-network-fastdatapath
Intelligence Requested:
Internal Whiteboard:
Market:
Sub-System Group:

ssg_networking

Severity:
Important

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem:

Customer has a product application that is running on two well-provisioned worker nodes. This application is exposed, using metalLB and a load test is being sent using UDP traffic that exceeds a rate of 30k requests per second.

They are aiming to achieve a rate of 120k requests per second and observe failures before they hit that period.

The backends of the application (and multiple other applications on the nodes being hammered by request rate) start to fail their local kubelet --> pod liveness probes. Health checks begin to fail, and pods are marked as not/ready. This removes the target application from the MetalLB as a valid backend and causes a flapping behavior.

pods on the node including DNS and ElasticSearch/(external namespaces/services just hosted on the same node) are impacted.

We see this issue any time we fire sufficient load at the service.    

traffic flow:
client --> [external network/firewall/LBs] --> BGP speaker pod [1] DNAT --> NODE (host) [br-ex] via [bond1.2039] --> br-int --> [logical gateways/switches for OVN] --> --> eth0 (pod)

Version-Release number of selected component (if applicable):

    OCP 4.14.14

bare-metal cluster

How reproducible:

every time

Steps to Reproduce:

    1. Send high rate of request at target application BGP speaker address (UDP traffic)
    2. Observe when traffic hits around 30k requests per second we start to observe instability on the pods/services, traffic flood is preventing pod's from being able to self-query for health probes
    3. observe syns dropped from requests to the backends (tcp capture on node shows syn retries), tcp capture on pod shows syn never arrived. Dropped in-transit via OVN gateways.

Actual results:

traffic is rate-limited to 30k requests/s before node networking degrades, cluster cannot handle desired workload

Expected results:

    traffic should not be rate-limited, UDP traffic should not interfere with TCP handshaking (or if it should, needs documentation/express limit validations).

Additional info:

   See first comment below (private note) with full detail + collaboration request.

//UPDATE: This appears to be due to conntrack bucket size limitation leading to dropped calls due to chaintoolong exceptions.

Detailed in KCS: https://access.redhat.com/solutions/7073555

pending upstream kernel testing

clones

FDP-579 OCP 4.14.14 - UDP traffic flood leading to TCP packet loss/application health probe failures (lost syns from kubelet to pod)

Closed

Assignee:: Aaron Conole

Reporter:: Will Russell

Need Info From:: Aaron Conole

Contributors:: Jaime Caamaño Ruiz

QA Contact:: Anurag Saxena

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024/09/10 3:32 PM

Updated:: 2025/01/29 5:01 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates