-
Bug
-
Resolution: Done
-
Major
-
4.19
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Durning BGP export scenario scale testing,
- High cpu usage of ovnkube-controller container
- High cpu usage on both master and worker nodes
- CPU usage remains high with spikes in every minute even after 3 hours
I am scale testing BGP router export scenario on a OCP 4.19 baremetal cluster with 24 worker nodes (no infra nodes) using 4.19.0-ec.3 with OVNK BGP image which is built using the PR build 4.19,openshift/ovn-kubernetes#2239 on 03/25/2025.
Test creates 72 CUDN, 1 namespace per CUDN and 1 pod per namespace. I waited for 30 minutes after creating all the resources and before creating router advertisements.
Then test creates 72 router advertisements where each RA advertises only one unique cudn i.e RA:CUDN is 1:1.
Average CPU usage of ovn-kubecontroller container is 788% and max cpu usage is 2139%.
We see very high cpu usage on each master node as well i.e avg 2935% and max 4625%.
cpu usage on each worker node avg 2835% and max 4472%
Component | Avg CPU usage % | Max CPU usage % |
---|---|---|
ovnkube-controller container | 788 | 2139 |
frr controller container | 1390 | 1813 |
Master node | 2909 | 4625 |
Worker node | 2765 | 4522 |
Another observation is CPU usage remains high with spikes in every minute even after 3 hours (i.e didn't create or delete any resources during this 3 hours). CPU usage came down to normal only after deleting the RA.
Test results are avaialble at https://docs.google.com/spreadsheets/d/1WLuTpcrTwFBUcZ-XF2wppOJL9M13_V_HYh2i23UpQUo/edit?usp=sharing
Grafana screenshots at https://storage.scalelab.redhat.com/anilvenkata/bgp/ra72export/
I can provide the live environment to the engineer for troubleshooting.
- links to