-
Epic
-
Resolution: Unresolved
-
Major
-
None
-
None
-
OVN controller with hostname in ovn-remote stuck in connecting state with OVN SB db restarts
-
13
-
False
-
False
-
-
rhel-9
-
rhel-net-ovn
-
90% To Do, 0% In Progress, 10% Done
-
ssg_networking
-
Important
This epic tracks all the effort needed to deliver the solution related to the bug described below.
Problem Description:
Originally reported in RHOSO 18 https://issues.redhat.com/browse/OSPRH-21332 when moved to OVN 25.03(from 24.03) and jobs started to fail during update as OVN controller stuck into connecting state[1] after the restart of OVN SB DBs.
Related slack thread in #ovn https://redhat-internal.slack.com/archives/C01G7T6SYSD/p1761662714544639
As pointed in thread It's a regression introduced in OVN 24.09 with https://github.com/ovn-org/ovn/commit/762ae66cd70efa149d91d35305fcef0040e9addd (Also confirmed with revert of this commit in 25.03/25.09 it works fine as previous releases)
[1]
2025-10-27T11:37:08Z|00105|stream_ssl|ERR|ssl:ovsdbserver-sb-2.openstack.svc.cluster.local:6642: connect: Address family not supported by protocol 2025-10-27T11:37:16Z|00106|stream_ssl|ERR|ssl:ovsdbserver-sb-1.openstack.svc.cluster.local:6642: connect: Address family not supported by protocol 2025-10-27T11:37:24Z|00107|stream_ssl|ERR|ssl:ovsdbserver-sb-0.openstack.svc.cluster.local:6642: connect: Address family not supported by protocol
Impact Assessment: OVN controller stuck in connecting state
Software Versions: Specify the exact versions in use
ovn25.03-25.03.1-60.el9fdp
openvswitch3.5-3.5.2-51.el9fdp
Issue Type: It's a regression introduced in OVN 24.09
Reproducibility: Always with the give scenario(multiple SB DB hostnames in external_ids:ovn-remote), not seen the issue with 1 replica
Reproduction Steps:
- Setup a 3 node OVN Raft cluster
- Configure OVN controller with external-ids:ovn-remote=<all three ovn sb db servers with hostname(with IPs issue do not reproduce)>
- Restart all OVN SB DBs
- Check OVN controller logs and observe it get's stuck into to connecting state
Can also be reproduced by using https://github.com/ovn-org/ovn-fake-multinode
Would require https://github.com/ovn-org/ovn-fake-multinode/pull/110
Setup env:-
git clone https://github.com/ovn-org/ovn/ ~/ovn -b branch-25.03 git clone https://github.com/openvswitch/ovs/ ~/ovs -b branch-3.5 sudo OVN_SRC_PATH=${HOME}/ovn OVS_SRC_PATH=${HOME}/ovs ./ovn_cluster.sh build sudo CENTRAL_IC_ID=ovn-central-az1-1 OVN_DB_CLUSTER=yes bash ./ovn_cluster.sh start
Configure OVN controller
sudo podman exec -it ovn-chassis-1 bash # ensure ovn-cluster-az1-1.example.com, ovn-cluster-az1-2.example.com and ovn-cluster-az1-3.example.com are configured on dns server with ips 170.168.0.2, 170.168.0.3 and 170.168.0.4 respectively and then configure ovn-remote as below:- ovs-vsctl set open . external_ids:ovn-remote="ssl:ovn-cluster-az1-1.example.com:6642,ssl:ovn-cluster-az1-2.example.com:6642,ssl:ovn-cluster-az1-3.example.com:6642" # Ensure it is connected to SB db using hostname by tail -f /var/log/ovn/ovn-controller.log or ovn-appctl connection-status # can also confirm with ovn-sbctl if these hostnames working using:- ovn-sbctl --db="ssl:ovn-cluster-az1-1.example.com:6642,ssl:ovn-cluster-az1-2.example.com:6642,ssl:ovn-cluster-az1-3.example.com:6642" --private-key=/opt/ovn/ovn-privkey.pem --certificate=/opt/ovn/ovn-cert.pem --ca-cert=/opt/ovn/pki/switchca/cacert.pem show
Kill SB DB on ovn central containers ovn-central-az1-1, ovn-central-az1-2 and ovn-central-az1-3
# ps -eaf|grep ovsdb-server-sb|grep -v grep root 941 940 0 10:51 ? 00:00:00 ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-sb.log --pidfile=/var/run/ovn/ovnsb_db.pid --remote=punix:/var/run/ovn/ovnsb_db.sock --unixctl=/var/run/ovn/ovnsb_db.ctl --detach --monitor --remote=db:OVN_Southbound,SB_Global,connections --private-key=/opt/ovn/ovn-privkey.pem --certificate=/opt/ovn/ovn-cert.pem --ca-cert=/opt/ovn/pki/switchca/cacert.pem --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers /etc/ovn/ovnsb_db.db kill -9 941
Start back the SB DB in all three central containers
ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/ovn/ovsdb-server-sb.log --pidfile=/var/run/ovn/ovnsb_db.pid --remote=punix:/var/run/ovn/ovnsb_db.sock --unixctl=/var/run/ovn/ovnsb_db.ctl --detach --monitor --remote=db:OVN_Southbound,SB_Global,connections --private-key=/opt/ovn/ovn-privkey.pem --certificate=/opt/ovn/ovn-cert.pem --ca-cert=/opt/ovn/pki/switchca/cacert.pem --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers /etc/ovn/ovnsb_db.db
Check ovn-controller logs in container ovn-chassis-1,
sudo podman exec -it ovn-chassis-1 bash
tail -f /var/log/ovn/ovn-controller.log
With buggy version it will be stuck like:-
2025-10-30T10:53:36.035Z|00234|stream_ssl|ERR|ssl:ovn-cluster-az1-1.example.com:6642: connect: Address family not supported by protocol
2025-10-30T10:53:44.042Z|00235|stream_ssl|ERR|ssl:ovn-cluster-az1-2.example.com:6642: connect: Address family not supported by protocol
2025-10-30T10:53:52.046Z|00236|stream_ssl|ERR|ssl:ovn-cluster-az1-3.example.com:6642: connect: Address family not supported by protocol
2025-10-30T10:54:00.055Z|00237|stream_ssl|ERR|ssl:ovn-cluster-az1-1.example.com:6642: connect: Address family not supported by protocol
2025-10-30T10:54:08.063Z|00238|stream_ssl|ERR|ssl:ovn-cluster-az1-2.example.com:6642: connect: Address family not supported by protocol
2025-10-30T10:54:16.072Z|00239|stream_ssl|ERR|ssl:ovn-cluster-az1-3.example.com:6642: connect: Address family not supported by protocol
2025-10-30T10:54:24.079Z|00240|stream_ssl|ERR|ssl:ovn-cluster-az1-1.example.com:6642: connect: Address family not supported by protocol
2025-10-30T10:54:32.088Z|00241|stream_ssl|ERR|ssl:ovn-cluster-az1-2.example.com:6642: connect: Address family not supported by protocol
2025-10-30T10:54:40.096Z|00242|stream_ssl|ERR|ssl:ovn-cluster-az1-3.example.com:6642: connect: Address family not supported by protocol
2025-10-30T10:54:48.104Z|00243|stream_ssl|ERR|ssl:ovn-cluster-az1-1.example.com:6642: connect: Address family not supported by protocol
2025-10-30T10:54:56.106Z|00244|stream_ssl|ERR|ssl:ovn-cluster-az1-2.example.com:6642: connect: Address family not supported by protocol
2025-10-30T10:55:04.115Z|00245|stream_ssl|ERR|ssl:ovn-cluster-az1-3.example.com:6642: connect: Address family not supported by protocol
2025-10-30T10:55:12.123Z|00246|stream_ssl|ERR|ssl:ovn-cluster-az1-1.example.com:6642: connect: Address family not supported by protocol
Expected Behavior: When SB db server comes back OVN controller should be able to reconnect immediately as previous releases.
Observed Behavior: OVN controller stuck into connecting state until is restarted or ovn-remote is updated.
Troubleshooting Actions: Outline the steps taken to diagnose or resolve the issue so far.
More details in
https://issues.redhat.com/browse/OSPRH-21332 and slack thread https://redhat-internal.slack.com/archives/C01G7T6SYSD/p1761662714544639
- is depended on by
-
OSPRH-21332 OVN Minor update broken FR4
-
- Refinement
-