-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
None
-
2
-
False
-
-
False
-
?
-
rhos-connectivity-neutron
-
None
-
-
-
-
Priority Bugs, Neutron Gluon 2
-
2
-
Critical
We are currently running nova workloads at large scale on a 250+ node RHOSO environment. This workload keeps a continuous churn on nova for long hours. During this time.. As number of VMs started growing and a continous churn from the workload, we observed that northd started experiencing higher load consuming 100% cpu(with default nThreads) at 12k+ VM scale only.
As this scale grows, OVN Northd pods continuously crash due to Kubernetes liveness probe failures. The OVN Operator hardcodes livenessProbe.timeoutSeconds=1, but Northd recompute operations at this scale are taking ~1.2 to ~3.9 seconds, causing the probe to timeout and Kubernetes to terminate the pod with SIGTERM (signal 15).
We tried increasing the nThreads to 4, 16, 32 and 64 for the parallel flows recomputation, but nothing seemed to help. And at ~32VMs/networks scale, entire northd cluster is down.
Workload characterization:
Number of VMs targeted for creation: 50k (but could not continue due to northd crashes)
Number of Iterations: 50k
In each iteration, 1 tenant network gets created and boots a VM.
Concurrency: 16 (number of parallel requests to Nova)
Northd status:
// During 15k+ VM scale (but this can occur much before) [root@e18-h18-000-r660 ~]# oc get pods | grep -i northd ovn-northd-0 1/2 CrashLoopBackOff 31 (2m58s ago) 127m ovn-northd-1 1/2 CrashLoopBackOff 33 (40s ago) 134m ovn-northd-2 1/2 Running 1 (3s ago) 30s // At ~32K VM scale (entire cluster is down) [root@e18-h18-000-r660 ~]# oc get pods | grep -i northd ovn-northd-0 1/2 CrashLoopBackOff 273 (52s ago) 16h ovn-northd-1 1/2 CrashLoopBackOff 271 (2m33s ago) 16h ovn-northd-2 1/2 CrashLoopBackOff 273 (4m59s ago) 16h
Glimpse of Northd logs:
2026-01-20T19:29:26Z|07107|timeval|WARN|Unreasonably long 1596ms poll interval (1582ms user, 7ms system) 2026-01-20T19:29:26Z|07108|timeval|WARN|faults: 15901 minor, 0 major 2026-01-20T19:29:26Z|07109|timeval|WARN|context switches: 0 voluntary, 1 involuntary 2026-01-20T19:29:27Z|07110|inc_proc_eng|INFO|Dropped 2 log messages in last 2 seconds (most recently, 1 seconds ago) due to excessive rate 2026-01-20T19:29:27Z|07111|inc_proc_eng|INFO|node: northd, recompute (missing handler for input SB_dns) took 552ms 2026-01-20T19:29:28Z|07112|timeval|WARN|Unreasonably long 1701ms poll interval (1673ms user, 21ms system) 2026-01-20T19:29:28Z|07113|timeval|WARN|faults: 38337 minor, 0 major 2026-01-20T19:29:28Z|07114|ovsdb_cs|INFO|ssl:ovsdbserver-sb-0.openstack.svc.cluster.local:6642: clustered database server is not cluster leader; trying another server 2026-01-20T19:29:28Z|07115|reconnect|INFO|ssl:ovsdbserver-sb-0.openstack.svc.cluster.local:6642: connection attempt timed out 2026-01-20T19:29:28Z|07116|reconnect|INFO|ssl:ovsdbserver-sb-1.openstack.svc.cluster.local:6642: connecting... 2026-01-20T19:29:28Z|07117|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. 2026-01-20T19:29:28Z|07118|reconnect|INFO|ssl:ovsdbserver-sb-1.openstack.svc.cluster.local:6642: connected 2026-01-20T19:29:29Z|07119|poll_loop|INFO|Dropped 37 log messages in last 6 seconds (most recently, 1 seconds ago) due to excessive rate 2026-01-20T19:29:29Z|07120|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (10.128.1.167:55404<->172.30.32.138:6641) at lib/stream-ssl.c:842 (95% CPU usage) // 2026-01-21T07:38:41Z|00066|poll_loop|INFO|Dropped 31 log messages in last 11 seconds (most recently, 11 seconds ago) due to excessive rate 2026-01-21T07:38:41Z|00067|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 22 (/tmp/ovn-northd.1.ctl<->) at lib/stream-fd.c:157 (99% CPU usage) 2026-01-21T07:38:41Z|00068|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (10.128.0.80:58236<->172.30.61.104:6641) at lib/stream-ssl.c:842 (99% CPU usage) 2026-01-21T07:38:42Z|00069|inc_proc_eng|INFO|node: northd, recompute (missing handler for input SB_datapath_binding) took 1190ms 2026-01-21T07:38:44Z|00001|fatal_signal(urcu1)|WARN|terminating with signal 15 (Terminated) 2026-01-21T07:38:44Z|00001|fatal_signal(stopwatch0)|WARN|terminating with signal 15 (Terminated)
Northd Events:
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 129m default-scheduler Successfully assigned openstack/ovn-northd-0 to e18-h20-000-r660 Normal AddedInterface 129m multus Add eth0 [10.128.0.80/23] from ovn-kubernetes Normal Pulled 129m kubelet Container image "registry.redhat.io/rhoso/openstack-ovn-northd-rhel9@sha256:8042a62f2f6e45f1e81b920161cd6acf581da7e4c98d2652b9d40412e9bf7764" already present on machine Normal Created 129m kubelet Created container: ovn-northd Normal Started 129m kubelet Started container ovn-northd Normal Pulled 129m kubelet Container image "registry.redhat.io/rhoso-operators/openstack-network-exporter-rhel9@sha256:5bedf735d38d01647bd08744f408c21ca24ca663eaaa9e8d51d38406cfbae52d" already present on machine Normal Created 129m kubelet Created container: openstack-network-exporter Normal Started 129m kubelet Started container openstack-network-exporter Warning Unhealthy 29m (x141 over 122m) kubelet Readiness probe failed: command timed out Warning Unhealthy 19m (x161 over 122m) kubelet Liveness probe failed: command timed out Warning BackOff 3m47s (x315 over 111m) kubelet Back-off restarting failed container ovn-northd in pod ovn-northd-0_openstack(e428775c-b9b9-4b78-b75d-fa09146afddc)
Statefulset Default liveness probe timeout:
// liveness probe default timeout is 1 second [root@e18-h18-000-r660 ~]# oc get statefulset ovn-northd -o yaml | grep -A10 "liveness" livenessProbe: exec: command: - /usr/local/bin/container-scripts/status_check.sh failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1 name: ovn-northd readinessProbe:
RHOSO version:
[root@e18-h18-000-r660 ~]# oc get openstackversion NAME TARGET VERSION AVAILABLE VERSION DEPLOYED VERSION openstack 18.0.15-20251126.192455 18.0.15-20251126.192455 18.0.15-20251126.192455
Actual Results:
Northd recompute delay cycles and liveness probe kills are making Northd a performance bottleneck for creation of VMs at scale. VM boot times increased from ~30s - 4min.
Expected results:
Northd to be able to handle load and kubernetes liveness probe timeout should be configurable.