-
Feature Request
-
Resolution: Unresolved
-
Critical
-
None
-
4.16
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
-
None
1. Proposed title of this feature request
ovn-kubernetes throttling to avoid hammering ovn-controller when the back pressure is too high
2. What is the nature and description of the request?
IHAC facing issue related to ovn-controller slowness at scale.
We are reproducing this in a test environment (so far ~5000 pods, 30094, netpols, 756 egressfirewall, putting that on 5 worker nodes)
Such number of pods is not used in production but issue is the same. This test setup is used to trigger this issue more easily.
Here in this particular case ovn-controller is seen taking CPU and perf top shows a lot of malloc, and random stack trace shows:
gdb.3-#0 0x00007fcd76d70610 in _int_malloc () from /lib64/libc.so.6 gdb.3-#1 0x00007fcd76d71809 in malloc () from /lib64/libc.so.6 gdb.3-#2 0x000055bc69d568cb in add_matches_to_flow_table () gdb.3:#3 0x000055bc69d57dbc in consider_logical_flow..lto_priv () gdb.3-#4 0x000055bc69d586ed in lflow_handle_changed_ref () gdb.3-#5 0x000055bc69dc07ca in objdep_mgr_handle_change () gdb.3-#6 0x000055bc69d8bc1e in lflow_output_port_groups_handler.lto_priv ()
and ovn-controller logs:
2025-05-15T06:22:06.703Z|62692|inc_proc_eng|INFO|node: logical_flow_output, handler for input port_groups took 10152ms 2025-05-15T06:22:16.737Z|62709|inc_proc_eng|INFO|node: logical_flow_output, handler for input port_groups took 9965ms 2025-05-15T06:22:26.913Z|62733|inc_proc_eng|INFO|node: logical_flow_output, handler for input port_groups took 10086ms
It may be related to a 2 minutes timeout that the CNI gives to the pods to be installed. If it takes more than 2 minutes then the network installation will never complete, it will just cycle in a loop. In such case ovnkube-controller keeps removing/recreating the pod, and also keeps on polling ovs to see the state of the pods:
- ps auxt | grep -c "/usr/bin/ovs-vsctl --timeout=30 --if-exists get Interface"
30
Instead of trying to fix ovn-controller slowness, it may be possible to improve ovn-kubernetes so that the CNI updates are not pushed if too much pressure is found on ovn-controller. This RFE aims ovn-kubernetes stops hammering ovn-controller when the back pressure is too high.
Some kind of QoS/Throttling where only a limited amount of pods can be installed at the same time.
3. Why does the customer need this? (List the business requirements here)
In order to move to OCP and scale accordingly to business requirement, my customer deploys large workloads on OCP. This limit is reach for a couple of those and blocks the migration. Splitting apps on multiple active/active clusters is not possible due to either the constraint migration timeframe or technical limitation with the workloads
4. List any affected packages or components.
Ovn-Kubernetes, Ovn