-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
False
-
-
False
-
-
rhel-sst-network-fastdatapath
-
-
-
-
ssg_networking
-
Important
//ISSUE:
Customer has a product application that is running on two well-provisioned worker nodes. This application is exposed, using metalLB and a load test is being sent using UDP traffic that exceeds a rate of 75k qps
They are aiming to achieve a rate of 120k requests per second and observe failures before they hit that period. The backends of the application (and multiple other applications on the nodes being hammered by request rate) start to fail their local kubelet --> pod liveness probes.
Health checks begin to fail, and pods are marked as not/ready. This removes the target application from the MetalLB as a valid backend and causes a flapping behavior on all pods on the hosts.
This is a continuance of: https://issues.redhat.com/browse/FDP-579 which was addressed/resolved with an increased conntrack bucket size embed + a recommended upgrade to 4.16. However, we are revisiting this problem because it has been observed that our previous problem replicator was using only a few localized IP addresses/tuples in it's requests to the nodes. When testing in production, we observed that we immediately overwhelm OVS on the nodes, and the primary difference is that we have THOUSANDS of unique IP's/tuples establishing these requests, relative to the 5 or so localized IP's from the same load-test we were running internally.
We modified the local replicator to have several thousand IP addresses and immediately replicated the behavior.
Local nodes still have a conntrack bucket override to ensure we aren't hitting the problem outlined previously. This behavior appears to be the same symptoms, but networking team believes it is probable we are making a new upcall request for each new tuple and it can't keep up.
Need assistance reviewing available data samples.
4.14.14, OVN-kubernetes.
traffic flow, details and case information in first comment (internal).
- relates to
-
FDP-992 northd should not create IPv6 prefix delegation logical flows if none configured
- Code Review