Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: ovn-operator
Labels:
None

Story Points:
2
Epic Link:
[BugEpic]: OVN Northd pods enter CrashLoopBackOff at scale (~32K VMs/networks) - livenessProbe timeoutSeconds=1 is insufficient for Northd recomputes.
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
Fixed in Build:
ovn-operator-container-1.0.20
AssignedTeam:
rhos-connectivity-neutron-gluon
Regression:
None
Intelligence Requested:
Market:
PX Impact Score:

Sprint:
Priority Bugs, Neutron Gluon 2
sprint_count:
2
Severity:
Critical

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

We are currently running nova workloads at large scale on a 250+ node RHOSO environment. This workload keeps a continuous churn on nova for long hours. During this time.. As number of VMs started growing and a continous churn from the workload, we observed that northd started experiencing higher load consuming 100% cpu(with default nThreads) at 12k+ VM scale only.

As this scale grows, OVN Northd pods continuously crash due to Kubernetes liveness probe failures. The OVN Operator hardcodes livenessProbe.timeoutSeconds=1, but Northd recompute operations at this scale are taking ~1.2 to ~3.9 seconds, causing the probe to timeout and Kubernetes to terminate the pod with SIGTERM (signal 15).

We tried increasing the nThreads to 4, 16, 32 and 64 for the parallel flows recomputation, but nothing seemed to help. And at ~32VMs/networks scale, entire northd cluster is down.

Workload characterization:
Number of VMs targeted for creation: 50k (but could not continue due to northd crashes)
Number of Iterations: 50k
In each iteration, 1 tenant network gets created and boots a VM.
Concurrency: 16 (number of parallel requests to Nova)

Northd status:

// During 15k+ VM scale (but this can occur much before)
[root@e18-h18-000-r660 ~]# oc get pods | grep -i northd
ovn-northd-0                                                      1/2     CrashLoopBackOff   31 (2m58s ago)   127m
ovn-northd-1                                                      1/2     CrashLoopBackOff   33 (40s ago)     134m
ovn-northd-2                                                      1/2     Running            1 (3s ago)       30s 

// At ~32K VM scale (entire cluster is down)
[root@e18-h18-000-r660 ~]# oc get pods | grep -i northd
ovn-northd-0                                                      1/2     CrashLoopBackOff   273 (52s ago)     16h
ovn-northd-1                                                      1/2     CrashLoopBackOff   271 (2m33s ago)   16h
ovn-northd-2                                                      1/2     CrashLoopBackOff   273 (4m59s ago)   16h

Glimpse of Northd logs:

2026-01-20T19:29:26Z|07107|timeval|WARN|Unreasonably long 1596ms poll interval (1582ms user, 7ms system)
2026-01-20T19:29:26Z|07108|timeval|WARN|faults: 15901 minor, 0 major
2026-01-20T19:29:26Z|07109|timeval|WARN|context switches: 0 voluntary, 1 involuntary
2026-01-20T19:29:27Z|07110|inc_proc_eng|INFO|Dropped 2 log messages in last 2 seconds (most recently, 1 seconds ago) due to excessive rate
2026-01-20T19:29:27Z|07111|inc_proc_eng|INFO|node: northd, recompute (missing handler for input SB_dns) took 552ms
2026-01-20T19:29:28Z|07112|timeval|WARN|Unreasonably long 1701ms poll interval (1673ms user, 21ms system)
2026-01-20T19:29:28Z|07113|timeval|WARN|faults: 38337 minor, 0 major
2026-01-20T19:29:28Z|07114|ovsdb_cs|INFO|ssl:ovsdbserver-sb-0.openstack.svc.cluster.local:6642: clustered database server is not cluster leader; trying another server
2026-01-20T19:29:28Z|07115|reconnect|INFO|ssl:ovsdbserver-sb-0.openstack.svc.cluster.local:6642: connection attempt timed out
2026-01-20T19:29:28Z|07116|reconnect|INFO|ssl:ovsdbserver-sb-1.openstack.svc.cluster.local:6642: connecting...
2026-01-20T19:29:28Z|07117|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
2026-01-20T19:29:28Z|07118|reconnect|INFO|ssl:ovsdbserver-sb-1.openstack.svc.cluster.local:6642: connected
2026-01-20T19:29:29Z|07119|poll_loop|INFO|Dropped 37 log messages in last 6 seconds (most recently, 1 seconds ago) due to excessive rate
2026-01-20T19:29:29Z|07120|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (10.128.1.167:55404<->172.30.32.138:6641) at lib/stream-ssl.c:842 (95% CPU usage)

//

2026-01-21T07:38:41Z|00066|poll_loop|INFO|Dropped 31 log messages in last 11 seconds (most recently, 11 seconds ago) due to excessive rate
2026-01-21T07:38:41Z|00067|poll_loop|INFO|wakeup due to [POLLIN][POLLHUP] on fd 22 (/tmp/ovn-northd.1.ctl<->) at lib/stream-fd.c:157 (99% CPU usage)
2026-01-21T07:38:41Z|00068|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (10.128.0.80:58236<->172.30.61.104:6641) at lib/stream-ssl.c:842 (99% CPU usage)
2026-01-21T07:38:42Z|00069|inc_proc_eng|INFO|node: northd, recompute (missing handler for input SB_datapath_binding) took 1190ms
2026-01-21T07:38:44Z|00001|fatal_signal(urcu1)|WARN|terminating with signal 15 (Terminated)
2026-01-21T07:38:44Z|00001|fatal_signal(stopwatch0)|WARN|terminating with signal 15 (Terminated)

Northd Events:

Events:
  Type     Reason          Age                     From               Message
  ----     ------          ----                    ----               -------
  Normal   Scheduled       129m                    default-scheduler  Successfully assigned openstack/ovn-northd-0 to e18-h20-000-r660
  Normal   AddedInterface  129m                    multus             Add eth0 [10.128.0.80/23] from ovn-kubernetes
  Normal   Pulled          129m                    kubelet            Container image "registry.redhat.io/rhoso/openstack-ovn-northd-rhel9@sha256:8042a62f2f6e45f1e81b920161cd6acf581da7e4c98d2652b9d40412e9bf7764" already present on machine
  Normal   Created         129m                    kubelet            Created container: ovn-northd
  Normal   Started         129m                    kubelet            Started container ovn-northd
  Normal   Pulled          129m                    kubelet            Container image "registry.redhat.io/rhoso-operators/openstack-network-exporter-rhel9@sha256:5bedf735d38d01647bd08744f408c21ca24ca663eaaa9e8d51d38406cfbae52d" already present on machine
  Normal   Created         129m                    kubelet            Created container: openstack-network-exporter
  Normal   Started         129m                    kubelet            Started container openstack-network-exporter
  Warning  Unhealthy       29m (x141 over 122m)    kubelet            Readiness probe failed: command timed out
  Warning  Unhealthy       19m (x161 over 122m)    kubelet            Liveness probe failed: command timed out
  Warning  BackOff         3m47s (x315 over 111m)  kubelet            Back-off restarting failed container ovn-northd in pod ovn-northd-0_openstack(e428775c-b9b9-4b78-b75d-fa09146afddc)

Statefulset Default liveness probe timeout:

// liveness probe default timeout is 1 second
[root@e18-h18-000-r660 ~]# oc get statefulset ovn-northd -o yaml | grep -A10 "liveness"
        livenessProbe:
          exec:
            command:
            - /usr/local/bin/container-scripts/status_check.sh
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1
        name: ovn-northd
        readinessProbe:

RHOSO version:

[root@e18-h18-000-r660 ~]# oc get openstackversion
NAME        TARGET VERSION            AVAILABLE VERSION         DEPLOYED VERSION
openstack   18.0.15-20251126.192455   18.0.15-20251126.192455   18.0.15-20251126.192455

Actual Results:
Northd recompute delay cycles and liveness probe kills are making Northd a performance bottleneck for creation of VMs at scale. VM boot times increased from ~30s - 4min.

Expected results:
Northd to be able to handle load and kubernetes liveness probe timeout should be configurable.

links to

openstack-k8s-operators/ovn-operator#526: [northd] Bump readiness/liveness probe timeouts

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty