-
Bug
-
Resolution: Done
-
Normal
-
4.11.z
-
None
-
Important
-
None
-
Rejected
-
False
-
Description of problem:
Running a large-scale setup with 117 nodes with 1756 persistent VMs and 3993 container disks VMs & hit a major OVN scaling problem triggered by the 5749 NodePorts (used for ssh service) created for each VM. As a results the RAM & CPU consumption of the ovnkube-master peaked and it was unable to handle the cluster: [root@e24-h10-000-r640 ~]# oc adm -n openshift-ovn-kubernetes top pod -l app=ovnkube-master NAME CPU(cores) MEMORY(bytes) ovnkube-master-bsrr6 273m 45821Mi ovnkube-master-cg68p 570m 49832Mi ovnkube-master-qfpnx 774m 58104Mi All 3 instances were crashing: NAME READY STATUS RESTARTS AGE ovnkube-master-7nnz5 5/6 CrashLoopBackOff 66 (55s ago) 5h26m ovnkube-master-7sd2l 5/6 CrashLoopBackOff 66 (2m31s ago) 5h27m ovnkube-master-jx4c5 5/6 CrashLoopBackOff 62 (3m52s ago) 5h7m Deleting all NodePorts did not resolve the crashLoop. This issue should be documented in the release notes since its extremely destructive -- CNV side tracked in BZ2128785. Note that Dan Williams is aware of this issue. Unfortunately, they were not able to recover and the cluster was redeployed and the reservation has now ended.
Version-Release number of selected component (if applicable):
OCP 4.11.4 OpenShift Virtualization 4.11.0 OVN internal version is : [22.06.1-20.23.0-63.4]
How reproducible:
Easily
Steps to Reproduce:
1. Create 1000s of NodePorts when using OVN-K, we do not believe this is specific to VMs VM NodePort usage is documented here: https://docs.openshift.com/container-platform/4.11/virt/virtual_machines/virt-accessing-vm-consoles.html#virt-accessing-vmi-ssh_virt-accessing-vm-consoles
Actual results:
OVN crashed and could not be recovered
Expected results:
OVN can scale with multiple service types, including NodePorts
Additional info:
A few log snippets: 2022-09-12T07:04:57.385Z|00086|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2022-09-12T07:05:06.556Z|00087|memory|INFO|peak resident set size grew 74% in last 306.0 seconds, from 39784 kB to 69224 kB 2022-09-12T07:04:57.385Z|00086|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2022-09-12T07:05:06.556Z|00087|memory|INFO|peak resident set size grew 74% in last 306.0 seconds, from 39784 kB to 69224 kB 2022-09-19T14:11:00.477Z|336607|timeval|WARN|Unreasonably long 95070ms poll interval (75605ms user, 18997ms system) 2022-09-19T14:14:21.719Z|102841|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2022-09-19T14:14:34.097Z|102842|inc_proc_eng|INFO|node: northd, recompute (forced) took 12376ms 2022-09-19T14:14:45.346Z|102843|inc_proc_eng|INFO|node: lflow, recompute (forced) took 11249ms 2022-09-19T14:15:36.182Z|102844|timeval|WARN|Unreasonably long 74461ms poll interval (71012ms user, 508ms system) 2022-09-20T12:28:30.709Z|04446|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 19 (/var/run/ovn/ovn-northd.9.ctl<->) at lib/stream-fd.c:157 (95% CPU usa ge) 2022-09-20T12:28:30.724Z|04447|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 18 (/var/run/ovn/ovn-northd.9.ctl<->) at lib/stream-fd.c:157 (95% CPU usa ge) 2022-09-20T12:28:30.740Z|04448|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 17 (/var/run/ovn/ovn-northd.9.ctl<->) at lib/stream-fd.c:157 (95% CPU usage) 2022-09-20T14:10:50.113Z|05024|jsonrpc|WARN|unix#8456: send error: Broken pipe 2022-09-20T14:10:50.158Z|05025|stream_ssl|WARN|SSL_accept: system error (Success) 2022-09-20T14:10:50.160Z|05026|stream_ssl|WARN|SSL_accept: system error (Success) 2022-09-20T14:10:50.160Z|05027|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (0.0.0.0:9642<->) at ../lib/stream-ssl.c:972 (92% CPU usage) 2022-09-20T14:10:50.160Z|05028|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 41 (/var/run/ovn/ovnsb_db.ctl<->) at ../lib/stream-fd.c:157 (92% CPU usage) 2022-09-20T14:10:50.160Z|05029|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (192.x.x.x:9644<->) at ../lib/stream-ssl.c:972 (92% CPU usage) 2022-09-20T14:10:50.160Z|05030|reconnect|WARN|ssl:192.x.x.x:48290: connection dropped (Protocol error) 2022-09-20T14:10:50.161Z|05031|stream_ssl|WARN|SSL_accept: system error (Success) 2022-09-20T14:10:50.161Z|05032|reconnect|WARN|ssl:192.x.x.x:34622: connection dropped (Protocol error) 2022-09-20T14:10:50.162Z|05033|stream_ssl|WARN|SSL_accept: system error (Success)