Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-25563

OVN Northd pods enter CrashLoopBackOff at scale (~32K VMs/networks) - livenessProbe timeoutSeconds=1 is insufficient for Northd recomputes.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • None
    • ovn-operator
    • None
    • Priority Bugs, Neutron Gluon 2
    • 2
    • Critical

      We are currently running nova workloads at large scale on a 250+ node RHOSO environment. This workload keeps a continuous churn on nova for long hours. During this time.. As number of VMs started growing and a continous churn from the workload, we observed that northd started experiencing higher load consuming 100% cpu(with default nThreads) at 12k+ VM scale only.

      As this scale grows, OVN Northd pods continuously crash due to Kubernetes liveness probe failures. The OVN Operator hardcodes livenessProbe.timeoutSeconds=1, but Northd recompute operations at this scale are taking ~1.2 to ~3.9 seconds, causing the probe to timeout and Kubernetes to terminate the pod with SIGTERM (signal 15).

      We tried increasing the nThreads to 4, 16, 32 and 64 for the parallel flows recomputation, but nothing seemed to help. And at ~32VMs/networks scale, entire northd cluster is down.

      Workload characterization:
      Number of VMs targeted for creation: 50k (but could not continue due to northd crashes)
      Number of Iterations: 50k
      In each iteration, 1 tenant network gets created and boots a VM.
      Concurrency: 16 (number of parallel requests to Nova)

      Northd status:

      // During 15k+ VM scale (but this can occur much before)
      [root@e18-h18-000-r660 ~]# oc get pods | grep -i northd
      ovn-northd-0                                                      1/2     CrashLoopBackOff   31 (2m58s ago)   127m
      ovn-northd-1                                                      1/2     CrashLoopBackOff   33 (40s ago)     134m
      ovn-northd-2                                                      1/2     Running            1 (3s ago)       30s 
      
      // At ~32K VM scale (entire cluster is down)
      [root@e18-h18-000-r660 ~]# oc get pods | grep -i northd
      ovn-northd-0                                                      1/2     CrashLoopBackOff   273 (52s ago)     16h
      ovn-northd-1                                                      1/2     CrashLoopBackOff   271 (2m33s ago)   16h
      ovn-northd-2                                                      1/2     CrashLoopBackOff   273 (4m59s ago)   16h

       

      Glimpse of Northd logs:

      2026-01-20T19:29:26Z|07107|timeval|WARN|Unreasonably long 1596ms poll interval (1582ms user, 7ms system)
      2026-01-20T19:29:26Z|07108|timeval|WARN|faults: 15901 minor, 0 major
      2026-01-20T19:29:26Z|07109|timeval|WARN|context switches: 0 voluntary, 1 involuntary
      2026-01-20T19:29:27Z|07110|inc_proc_eng|INFO|Dropped 2 log messages in last 2 seconds (most recently, 1 seconds ago) due to excessive rate
      2026-01-20T19:29:27Z|07111|inc_proc_eng|INFO|node: northd, recompute (missing handler for input SB_dns) took 552ms
      2026-01-20T19:29:28Z|07112|timeval|WARN|Unreasonably long 1701ms poll interval (1673ms user, 21ms system)
      2026-01-20T19:29:28Z|07113|timeval|WARN|faults: 38337 minor, 0 major
      2026-01-20T19:29:28Z|07114|ovsdb_cs|INFO|ssl:ovsdbserver-sb-0.openstack.svc.cluster.local:6642: clustered database server is not cluster leader; trying another server
      2026-01-20T19:29:28Z|07115|reconnect|INFO|ssl:ovsdbserver-sb-0.openstack.svc.cluster.local:6642: connection attempt timed out
      2026-01-20T19:29:28Z|07116|reconnect|INFO|ssl:ovsdbserver-sb-1.openstack.svc.cluster.local:6642: connecting...
      2026-01-20T19:29:28Z|07117|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
      2026-01-20T19:29:28Z|07118|reconnect|INFO|ssl:ovsdbserver-sb-1.openstack.svc.cluster.local:6642: connected
      2026-01-20T19:29:29Z|07119|poll_loop|INFO|Dropped 37 log messages in last 6 seconds (most recently, 1 seconds ago) due to excessive rate
      2026-01-20T19:29:29Z|07120|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (10.128.1.167:55404<->172.30.32.138:6641) at lib/stream-ssl.c:842 (95% CPU usage)
      
      //
      
      2026-01-21T07:38:41Z|00066|poll_loop|INFO|Dropped 31 log messages in last 11 seconds (most recently, 11 seconds ago) due to excessive rate
      2026-01-21T07:38:41Z|00067|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 22 (/tmp/ovn-northd.1.ctl<->) at lib/stream-fd.c:157 (99% CPU usage)
      2026-01-21T07:38:41Z|00068|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (10.128.0.80:58236<->172.30.61.104:6641) at lib/stream-ssl.c:842 (99% CPU usage)
      2026-01-21T07:38:42Z|00069|inc_proc_eng|INFO|node: northd, recompute (missing handler for input SB_datapath_binding) took 1190ms
      2026-01-21T07:38:44Z|00001|fatal_signal(urcu1)|WARN|terminating with signal 15 (Terminated)
      2026-01-21T07:38:44Z|00001|fatal_signal(stopwatch0)|WARN|terminating with signal 15 (Terminated) 

      Northd Events:

      Events:
        Type     Reason          Age                     From               Message
        ----     ------          ----                    ----               -------
        Normal   Scheduled       129m                    default-scheduler  Successfully assigned openstack/ovn-northd-0 to e18-h20-000-r660
        Normal   AddedInterface  129m                    multus             Add eth0 [10.128.0.80/23] from ovn-kubernetes
        Normal   Pulled          129m                    kubelet            Container image "registry.redhat.io/rhoso/openstack-ovn-northd-rhel9@sha256:8042a62f2f6e45f1e81b920161cd6acf581da7e4c98d2652b9d40412e9bf7764" already present on machine
        Normal   Created         129m                    kubelet            Created container: ovn-northd
        Normal   Started         129m                    kubelet            Started container ovn-northd
        Normal   Pulled          129m                    kubelet            Container image "registry.redhat.io/rhoso-operators/openstack-network-exporter-rhel9@sha256:5bedf735d38d01647bd08744f408c21ca24ca663eaaa9e8d51d38406cfbae52d" already present on machine
        Normal   Created         129m                    kubelet            Created container: openstack-network-exporter
        Normal   Started         129m                    kubelet            Started container openstack-network-exporter
        Warning  Unhealthy       29m (x141 over 122m)    kubelet            Readiness probe failed: command timed out
        Warning  Unhealthy       19m (x161 over 122m)    kubelet            Liveness probe failed: command timed out
        Warning  BackOff         3m47s (x315 over 111m)  kubelet            Back-off restarting failed container ovn-northd in pod ovn-northd-0_openstack(e428775c-b9b9-4b78-b75d-fa09146afddc) 

      Statefulset Default liveness probe timeout:

      // liveness probe default timeout is 1 second
      [root@e18-h18-000-r660 ~]# oc get statefulset ovn-northd -o yaml | grep -A10 "liveness"
              livenessProbe:
                exec:
                  command:
                  - /usr/local/bin/container-scripts/status_check.sh
                failureThreshold: 3
                initialDelaySeconds: 10
                periodSeconds: 5
                successThreshold: 1
                timeoutSeconds: 1
              name: ovn-northd
              readinessProbe: 

      RHOSO version:

      [root@e18-h18-000-r660 ~]# oc get openstackversion
      NAME        TARGET VERSION            AVAILABLE VERSION         DEPLOYED VERSION
      openstack   18.0.15-20251126.192455   18.0.15-20251126.192455   18.0.15-20251126.192455 

      Actual Results:
      Northd recompute delay cycles and liveness probe kills are making Northd a performance bottleneck for creation of VMs at scale. VM boot times increased from ~30s - 4min.

      Expected results:
      Northd to be able to handle load and kubernetes liveness probe timeout should be configurable.

              ykarel@redhat.com Yatin Karel
              rpulapak@redhat.com Rajesh Pulapakula
              rhos-dfg-networking-squad-neutron
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: