-
Bug
-
Resolution: Done
-
Normal
-
None
-
4.16
-
No
-
False
-
Description of problem:
During EIP failover scale testing, when the environment has 24000 EIPs and each node has 200 EIPs, we observed incorrect increment value for ovnkube_clustermanager_egress_ips_rebalance_total. This metric is incremented by only 64 though the node which is restarted is having 200 EIPs. Test creates 24000 namespaces wherein for each namespace it creates 1 EIP object with 1 EIP address and 1 Deployment with 1 pod replicas. Thus test creates 24000 namespaces, 24000 EIPs, 24000 pods. We observed that each node has 200 EIPs and 200 pods [root@f20-h01-000-r640 kube-burner-ocp]# oc get egressip -oyaml | grep node: | sort | uniq -c 200 node: f20-h11-000-r640.rdu2.scalelab.redhat.com 200 node: f20-h14-000-r640.rdu2.scalelab.redhat.com 200 node: f20-h17-000-r640.rdu2.scalelab.redhat.com 200 node: f20-h18-000-r640.rdu2.scalelab.redhat.com 200 node: f20-h19-000-r640.rdu2.scalelab.redhat.com 200 node: f20-h21-000-r640.rdu2.scalelab.redhat.com 200 node: f20-h22-000-r640.rdu2.scalelab.redhat.com 200 node: f20-h25-000-r640.rdu2.scalelab.redhat.com 200 node: f20-h27-000-r640.rdu2.scalelab.redhat.com 200 node: f20-h29-000-r640.rdu2.scalelab.redhat.com 200 node: f21-h01-000-r640.rdu2.scalelab.redhat.com 200 node: f21-h03-000-r640.rdu2.scalelab.redhat.com 200 node: f21-h05-000-r640.rdu2.scalelab.redhat.com 200 node: f21-h06-000-r640.rdu2.scalelab.redhat.com 200 node: f21-h10-000-r640.rdu2.scalelab.redhat.com 200 node: f21-h11-000-r640.rdu2.scalelab.redhat.com 200 node: f21-h13-000-r640.rdu2.scalelab.redhat.com 200 node: f21-h14-000-r640.rdu2.scalelab.redhat.com 200 node: f21-h15-000-r640.rdu2.scalelab.redhat.com 200 node: f21-h17-000-r640.rdu2.scalelab.redhat.com 200 node: f21-h18-000-r640.rdu2.scalelab.redhat.com 200 node: f21-h19-000-r640.rdu2.scalelab.redhat.com 200 node: f21-h21-000-r640.rdu2.scalelab.redhat.com 200 node: f30-h06-000-r640.rdu2.scalelab.redhat.com 200 node: f30-h07-000-r640.rdu2.scalelab.redhat.com 200 node: f30-h09-000-r640.rdu2.scalelab.redhat.com 200 node: f30-h10-000-r640.rdu2.scalelab.redhat.com 200 node: f30-h11-000-r640.rdu2.scalelab.redhat.com 200 node: f30-h13-000-r640.rdu2.scalelab.redhat.com 200 node: f30-h14-000-r640.rdu2.scalelab.redhat.com 200 node: f30-h15-000-r640.rdu2.scalelab.redhat.com 200 node: f30-h18-000-r640.rdu2.scalelab.redhat.com 200 node: f30-h19-000-r640.rdu2.scalelab.redhat.com 200 node: f30-h21-000-r640.rdu2.scalelab.redhat.com 200 node: f31-h01-000-r640.rdu2.scalelab.redhat.com 200 node: f31-h02-000-r640.rdu2.scalelab.redhat.com 200 node: f31-h03-000-r640.rdu2.scalelab.redhat.com 200 node: f31-h05-000-r640.rdu2.scalelab.redhat.com 200 node: f31-h06-000-r640.rdu2.scalelab.redhat.com 200 node: f31-h07-000-r640.rdu2.scalelab.redhat.com 200 node: f31-h09-000-r640.rdu2.scalelab.redhat.com 200 node: f31-h10-000-r640.rdu2.scalelab.redhat.com 200 node: f31-h11-000-r640.rdu2.scalelab.redhat.com 200 node: f31-h13-000-r640.rdu2.scalelab.redhat.com 200 node: f31-h14-000-r640.rdu2.scalelab.redhat.com 200 node: f31-h15-000-r640.rdu2.scalelab.redhat.com 200 node: f31-h17-000-r640.rdu2.scalelab.redhat.com 200 node: f31-h18-000-r640.rdu2.scalelab.redhat.com 200 node: f31-h19-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h01-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h02-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h03-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h05-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h06-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h07-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h09-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h10-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h11-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h13-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h14-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h15-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h17-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h18-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h19-000-r640.rdu2.scalelab.redhat.com 200 node: f32-h21-000-r640.rdu2.scalelab.redhat.com 200 node: f33-h01-000-r640.rdu2.scalelab.redhat.com 200 node: f33-h02-000-r640.rdu2.scalelab.redhat.com 200 node: f33-h05-000-r640.rdu2.scalelab.redhat.com 200 node: f33-h06-000-r640.rdu2.scalelab.redhat.com 200 node: f33-h07-000-r640.rdu2.scalelab.redhat.com 200 node: f33-h09-000-r640.rdu2.scalelab.redhat.com 200 node: f33-h10-000-r640.rdu2.scalelab.redhat.com 200 node: f33-h11-000-r640.rdu2.scalelab.redhat.com 200 node: f33-h13-000-r640.rdu2.scalelab.redhat.com 200 node: f33-h14-000-r640.rdu2.scalelab.redhat.com 200 node: f33-h15-000-r640.rdu2.scalelab.redhat.com 200 node: f33-h17-000-r640.rdu2.scalelab.redhat.com 200 node: f33-h18-000-r640.rdu2.scalelab.redhat.com 200 node: f33-h19-000-r640.rdu2.scalelab.redhat.com 200 node: f33-h21-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h01-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h02-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h03-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h05-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h06-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h07-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h09-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h10-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h11-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h13-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h14-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h15-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h17-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h18-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h19-000-r640.rdu2.scalelab.redhat.com 200 node: f34-h21-000-r640.rdu2.scalelab.redhat.com 200 node: f35-h01-000-r640.rdu2.scalelab.redhat.com 200 node: f35-h02-000-r640.rdu2.scalelab.redhat.com 200 node: f35-h03-000-r640.rdu2.scalelab.redhat.com 200 node: f35-h05-000-r640.rdu2.scalelab.redhat.com 200 node: f35-h06-000-r640.rdu2.scalelab.redhat.com 200 node: f35-h07-000-r640.rdu2.scalelab.redhat.com 200 node: f35-h09-000-r640.rdu2.scalelab.redhat.com 200 node: f35-h10-000-r640.rdu2.scalelab.redhat.com 200 node: f35-h11-000-r640.rdu2.scalelab.redhat.com 200 node: f35-h13-000-r640.rdu2.scalelab.redhat.com 200 node: f35-h14-000-r640.rdu2.scalelab.redhat.com 200 node: f35-h15-000-r640.rdu2.scalelab.redhat.com 200 node: f35-h17-000-r640.rdu2.scalelab.redhat.com 200 node: f35-h19-000-r640.rdu2.scalelab.redhat.com 200 node: f35-h21-000-r640.rdu2.scalelab.redhat.com 200 node: f36-h01-000-r640.rdu2.scalelab.redhat.com 200 node: f36-h02-000-r640.rdu2.scalelab.redhat.com 200 node: f36-h03-000-r640.rdu2.scalelab.redhat.com 200 node: f36-h05-000-r640.rdu2.scalelab.redhat.com 200 node: f36-h06-000-r640.rdu2.scalelab.redhat.com 200 node: f36-h07-000-r640.rdu2.scalelab.redhat.com 200 node: f36-h09-000-r640.rdu2.scalelab.redhat.com 200 node: f36-h10-000-r640.rdu2.scalelab.redhat.com 200 node: f36-h11-000-r640.rdu2.scalelab.redhat.com We have rebooted node f36-h09-000-r640.rdu2.scalelab.redhat.com which has 200 EIPs [root@f20-h01-000-r640 kube-burner-ocp]# oc get egressip -oyaml | grep "node: f36-h09-000-r640.rdu2.scalelab.redhat.com" | sort | uniq -c 200 node: f36-h09-000-r640.rdu2.scalelab.redhat.com But we see this metric increased by only 64 instead of 200. EIP count on the node after reboot is empty as expected [root@f20-h01-000-r640 kube-burner-ocp]# oc get egressip -oyaml | grep "node: f36-h09-000-r640.rdu2.scalelab.redhat.com" | sort | uniq -c [root@f20-h01-000-r640 kube-burner-ocp]# More details about the testing and observations are at https://docs.google.com/document/d/17NGv6pR-3VFVD5hFzpcdYt5hRZChOSIzmm0BPc-Rdx8/edit?usp=sharing Environment details - OCP deployed on bare metal nodes - 120 workers, 2 infra, 3 masters. All nodes have same configuration - CPUs: 80 Memory: 384G NIC bandwidth: 25 gb/s
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
Perf & OVN team developed a custom workload for EIP scale testing. We have the OCP deployment now to debug this issue.
Actual results:
This metric's incremented count is less than the EIPs assigned to node which restarted.
Expected results:
This metric's incremented count should be qual to EIPs assigned to node which restarted.
Additional info: