Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-1940

OVN at scale | 1000s of NodePorts for VMs cause OVN to crash and cannot recover

XMLWordPrintable

    • Important
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Running a large-scale setup with 117 nodes with 1756 persistent VMs and 3993 container disks VMs & hit a major OVN scaling problem triggered by the 5749 NodePorts (used for ssh service) created for each VM.
      
      As a results the RAM & CPU consumption of the ovnkube-master peaked and it was unable to handle the cluster:
      
      [root@e24-h10-000-r640 ~]# oc adm -n openshift-ovn-kubernetes top pod -l app=ovnkube-master
      NAME                   CPU(cores)   MEMORY(bytes)   
      ovnkube-master-bsrr6   273m         45821Mi         
      ovnkube-master-cg68p   570m         49832Mi         
      ovnkube-master-qfpnx   774m         58104Mi  
      
      All 3 instances were crashing:
      
      NAME                   READY   STATUS             RESTARTS         AGE
      ovnkube-master-7nnz5   5/6     CrashLoopBackOff   66 (55s ago)     5h26m
      ovnkube-master-7sd2l   5/6     CrashLoopBackOff   66 (2m31s ago)   5h27m
      ovnkube-master-jx4c5   5/6     CrashLoopBackOff   62 (3m52s ago)   5h7m
      
      Deleting all NodePorts did not resolve the crashLoop. 
      
      This issue should be documented in the release notes since its extremely destructive -- CNV side tracked in BZ2128785.
      
      Note that Dan Williams is aware of this issue.
      Unfortunately, they were not able to recover and the cluster was redeployed and the reservation has now ended. 

      Version-Release number of selected component (if applicable):

      OCP 4.11.4 
      OpenShift Virtualization 4.11.0
      OVN internal version is : [22.06.1-20.23.0-63.4]

      How reproducible:

      Easily

      Steps to Reproduce:

      1. Create 1000s of NodePorts when using OVN-K, we do not believe this is specific to VMs
      
      VM NodePort usage is documented here: https://docs.openshift.com/container-platform/4.11/virt/virtual_machines/virt-accessing-vm-consoles.html#virt-accessing-vmi-ssh_virt-accessing-vm-consoles 

      Actual results:

      OVN crashed and could not be recovered

      Expected results:

      OVN can scale with multiple service types, including NodePorts

      Additional info:

      A few log snippets:
      
      2022-09-12T07:04:57.385Z|00086|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
      2022-09-12T07:05:06.556Z|00087|memory|INFO|peak resident set size grew 74% in last 306.0 seconds, from 39784 kB to 69224 kB
      
      2022-09-12T07:04:57.385Z|00086|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
      2022-09-12T07:05:06.556Z|00087|memory|INFO|peak resident set size grew 74% in last 306.0 seconds, from 39784 kB to 69224 kB
      
      2022-09-19T14:11:00.477Z|336607|timeval|WARN|Unreasonably long 95070ms poll interval (75605ms user, 18997ms system)
      
      2022-09-19T14:14:21.719Z|102841|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.                                                                                                                      
      2022-09-19T14:14:34.097Z|102842|inc_proc_eng|INFO|node: northd, recompute (forced) took 12376ms                                                                                                                                        
      2022-09-19T14:14:45.346Z|102843|inc_proc_eng|INFO|node: lflow, recompute (forced) took 11249ms                                                                                                                                         
      2022-09-19T14:15:36.182Z|102844|timeval|WARN|Unreasonably long 74461ms poll interval (71012ms user, 508ms system)  
      
      2022-09-20T12:28:30.709Z|04446|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 19 (/var/run/ovn/ovn-northd.9.ctl<->) at lib/stream-fd.c:157 (95% CPU usa
      ge)
      2022-09-20T12:28:30.724Z|04447|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 18 (/var/run/ovn/ovn-northd.9.ctl<->) at lib/stream-fd.c:157 (95% CPU usa
      ge)
      2022-09-20T12:28:30.740Z|04448|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 17 (/var/run/ovn/ovn-northd.9.ctl<->) at lib/stream-fd.c:157 
      (95% CPU usage)
      
      2022-09-20T14:10:50.113Z|05024|jsonrpc|WARN|unix#8456: send error: Broken pipe
      2022-09-20T14:10:50.158Z|05025|stream_ssl|WARN|SSL_accept: system error (Success)
      2022-09-20T14:10:50.160Z|05026|stream_ssl|WARN|SSL_accept: system error (Success)
      2022-09-20T14:10:50.160Z|05027|poll_loop|INFO|wakeup due to [POLLIN] on fd 18 (0.0.0.0:9642<->) at ../lib/stream-ssl.c:972 (92% CPU usage)
      2022-09-20T14:10:50.160Z|05028|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 41 (/var/run/ovn/ovnsb_db.ctl<->) at ../lib/stream-fd.c:157 (92% CPU usage)
      2022-09-20T14:10:50.160Z|05029|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (192.x.x.x:9644<->) at ../lib/stream-ssl.c:972 (92% CPU usage)
      2022-09-20T14:10:50.160Z|05030|reconnect|WARN|ssl:192.x.x.x:48290: connection dropped (Protocol error)
      2022-09-20T14:10:50.161Z|05031|stream_ssl|WARN|SSL_accept: system error (Success)
      2022-09-20T14:10:50.161Z|05032|reconnect|WARN|ssl:192.x.x.x:34622: connection dropped (Protocol error)
      2022-09-20T14:10:50.162Z|05033|stream_ssl|WARN|SSL_accept: system error (Success)

       

              jcaamano@redhat.com Jaime Caamaño Ruiz
              jhopper@redhat.com Jenifer Abrams
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: