-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
4.13
-
None
-
No
-
SDN Sprint 246
-
1
-
Rejected
-
False
-
Description of problem:
ovn-kubernetes: NB DB Raft leader is unknown to the cluster node.
Few applications pods fail with:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_mariadb-pipelines-definition-745fc46d64-wprjf_ds-1_bc4c8728-0d4c-45f9-b709-4eab802fa0ab_0(5b88e8ba48540ededa4a12b7eddba528b8721201578a371d53c120c07e4299f1): error adding pod ds-1_mariadb-pipelines-definition-745fc46d64-wprjf to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [ds-1/mariadb-pipelines-definition-745fc46d64-wprjf/bc4c8728-0d4c-45f9-b709-4eab802fa0ab:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[ds-1/mariadb-pipelines-definition-745fc46d64-wprjf 5b88e8ba48540ededa4a12b7eddba528b8721201578a371d53c120c07e4299f1] [ds-1/mariadb-pipelines-definition-745fc46d64-wprjf 5b88e8ba48540ededa4a12b7eddba528b8721201578a371d53c120c07e4299f1] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:81:02:e8 [10.129.2.232/23] '
Looking in ovnkube-master-qk4rdNamespaceNSopenshift-ovn-kubernetes I see:
Readiness probe failed: NB DB Raft leader is unknown to the cluster node. + [[ ! ssl:192.169.1.138:9641,ssl:192.169.2.219:9641,ssl:192.169.1.91:9641 =~ .*:192\.169\.1\.91:.* ]] ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound ++ grep 'Leader: unknown' + leader_status='Leader: unknown' + [[ ! -z Leader: unknown ]] + echo 'NB DB Raft leader is unknown to the cluster node.' + exit 1
Version-Release number of selected component (if applicable):
OCP 4.13.5
How reproducible:
Unknown
Steps to Reproduce:
1. Install OCP 4.13.5 with OVN on Openstack (PSI) 2. Install RHODS 1.34 (Might not be the root cause, but it was done on the env) 3. Install OCP Pipeline operator
Actual results:
Many applications failed with `error adding container to network "ovn-kubernetes": CNI request failed with status 400`, which prevent any operation with those application, for example - Pipeline server:
Expected results:
Additional info:
Restarting OCP master nodes (Openstack instances) seemed to resolve it, but in openshift-ovn-kubernetes there's still a failed pod showing:
$▶ oc logs -n openshift-ovn-kubernetes ovnkube-master-bwkhl Defaulted container "northd" out of: northd, nbdb, kube-rbac-proxy, sbdb, ovnkube-master, ovn-dbchecker + [[ -f /env/_master ]] + trap quit TERM INT ++ date -Iseconds + echo '2023-10-22T15:20:42+00:00 - starting ovn-northd' 2023-10-22T15:20:42+00:00 - starting ovn-northd + wait 7 + exec ovn-northd --no-chdir -vconsole:info -vfile:off '-vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m' --ovnnb-db ssl:192.169.1.138:9641,ssl:192.169.2.219:9641,ssl:192.169.1.91:9641 --ovnsb-db ssl:192.169.1.138:9642,ssl:192.169.2.219:9642,ssl:192.169.1.91:9642 --pidfile /var/run/ovn/ovn-northd.pid --n-threads=4 -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt 2023-10-22T15:20:42.793Z|00001|ovn_northd|INFO|Using 4 threads 2023-10-22T15:20:42.793Z|00002|ovn_northd|INFO|OVN internal version is : [23.03.1-20.27.0-70.6] 2023-10-22T15:20:42.794Z|00003|ovn_parallel_hmap|INFO|Setting thread count to 4 2023-10-22T15:20:42.794Z|00004|ovn_parallel_hmap|INFO|Creating new pool with size 4 2023-10-22T15:20:42.800Z|00005|reconnect|INFO|ssl:192.169.1.138:9641: connecting... 2023-10-22T15:20:42.800Z|00006|ovn_northd|INFO|OVN NB IDL reconnected, force recompute. 2023-10-22T15:20:42.801Z|00007|reconnect|INFO|ssl:192.169.2.219:9642: connecting... 2023-10-22T15:20:42.801Z|00008|reconnect|INFO|ssl:192.169.2.219:9642: connection attempt failed (Connection refused) 2023-10-22T15:20:42.801Z|00009|reconnect|INFO|ssl:192.169.1.138:9642: connecting... 2023-10-22T15:20:42.801Z|00010|ovn_northd|INFO|OVN SB IDL reconnected, force recompute. 2023-10-22T15:20:42.801Z|00011|reconnect|INFO|ssl:192.169.1.138:9641: connection attempt failed (Connection refused) 2023-10-22T15:20:42.801Z|00012|reconnect|INFO|ssl:192.169.2.219:9641: connecting... 2023-10-22T15:20:42.801Z|00013|reconnect|INFO|ssl:192.169.2.219:9641: connection attempt failed (Connection refused) 2023-10-22T15:20:42.801Z|00014|reconnect|INFO|ssl:192.169.1.91:9641: connecting... 2023-10-22T15:20:42.801Z|00015|reconnect|INFO|ssl:192.169.1.138:9642: connection attempt failed (Connection refused) 2023-10-22T15:20:42.801Z|00016|reconnect|INFO|ssl:192.169.1.91:9642: connecting... 2023-10-22T15:20:42.802Z|00017|reconnect|INFO|ssl:192.169.1.91:9642: connection attempt failed (Connection refused) 2023-10-22T15:20:42.807Z|00018|reconnect|INFO|ssl:192.169.1.91:9641: connected 2023-10-22T15:20:42.810Z|00019|ovsdb_cs|INFO|ssl:192.169.1.91:9641: clustered database server is disconnected from cluster; trying another server 2023-10-22T15:20:42.810Z|00020|reconnect|INFO|ssl:192.169.1.91:9641: connection attempt timed out 2023-10-22T15:20:43.804Z|00021|reconnect|INFO|ssl:192.169.2.219:9642: connecting... 2023-10-22T15:20:43.804Z|00022|reconnect|INFO|ssl:192.169.2.219:9642: connection attempt failed (Connection refused) 2023-10-22T15:20:43.804Z|00023|reconnect|INFO|ssl:192.169.2.219:9642: waiting 2 seconds before reconnect 2023-10-22T15:20:43.810Z|00024|reconnect|INFO|ssl:192.169.1.138:9641: connecting... 2023-10-22T15:20:43.810Z|00025|reconnect|INFO|ssl:192.169.1.138:9641: connection attempt failed (Connection refused) 2023-10-22T15:20:43.810Z|00026|reconnect|INFO|ssl:192.169.1.138:9641: waiting 2 seconds before reconnect 2023-10-22T15:20:45.805Z|00027|reconnect|INFO|ssl:192.169.1.138:9642: connecting... 2023-10-22T15:20:45.805Z|00028|reconnect|INFO|ssl:192.169.1.138:9642: connection attempt failed (Connection refused) 2023-10-22T15:20:45.805Z|00029|reconnect|INFO|ssl:192.169.1.138:9642: waiting 4 seconds before reconnect 2023-10-22T15:20:45.810Z|00030|reconnect|INFO|ssl:192.169.2.219:9641: connecting... 2023-10-22T15:20:45.815Z|00031|reconnect|INFO|ssl:192.169.2.219:9641: connected 2023-10-22T15:20:45.816Z|00032|ovsdb_cs|INFO|ssl:192.169.2.219:9641: clustered database server is disconnected from cluster; trying another server 2023-10-22T15:20:45.816Z|00033|reconnect|INFO|ssl:192.169.2.219:9641: connection attempt timed out 2023-10-22T15:20:45.816Z|00034|reconnect|INFO|ssl:192.169.2.219:9641: waiting 4 seconds before reconnect 2023-10-22T15:20:49.809Z|00035|reconnect|INFO|ssl:192.169.1.91:9642: connecting... 2023-10-22T15:20:49.809Z|00036|reconnect|INFO|ssl:192.169.1.91:9642: connection attempt failed (Connection refused) 2023-10-22T15:20:49.809Z|00037|reconnect|INFO|ssl:192.169.1.91:9642: continuing to reconnect in the background but suppressing further logging 2023-10-22T15:20:49.817Z|00038|reconnect|INFO|ssl:192.169.1.91:9641: connecting... 2023-10-22T15:20:49.822Z|00039|reconnect|INFO|ssl:192.169.1.91:9641: connected 2023-10-22T15:20:49.824Z|00040|ovsdb_cs|INFO|ssl:192.169.1.91:9641: clustered database server is disconnected from cluster; trying another server 2023-10-22T15:20:49.824Z|00041|reconnect|INFO|ssl:192.169.1.91:9641: connection attempt timed out 2023-10-22T15:20:49.824Z|00042|reconnect|INFO|ssl:192.169.1.91:9641: continuing to reconnect in the background but suppressing further logging 2023-10-22T15:20:57.818Z|00043|memory|INFO|12336 kB peak resident set size after 15.0 seconds 2023-10-22T15:21:05.830Z|00044|reconnect|INFO|ssl:192.169.2.219:9641: connected 2023-10-22T15:21:05.832Z|00045|ovsdb_cs|INFO|ssl:192.169.2.219:9641: clustered database server is disconnected from cluster; trying another server 2023-10-22T15:21:13.839Z|00046|reconnect|INFO|ssl:192.169.1.91:9641: connected 2023-10-22T15:21:13.842Z|00047|ovsdb_cs|INFO|ssl:192.169.1.91:9641: clustered database server is disconnected from cluster; trying another server 2023-10-22T15:21:21.848Z|00048|reconnect|INFO|ssl:192.169.1.138:9641: connected 2023-10-22T15:21:21.850Z|00049|ovsdb_cs|INFO|ssl:192.169.1.138:9641: clustered database server is disconnected from cluster; trying another server 2023-10-22T15:21:29.856Z|00050|reconnect|INFO|ssl:192.169.2.219:9641: connected 2023-10-22T15:21:29.859Z|00051|ovsdb_cs|INFO|ssl:192.169.2.219:9641: clustered database server is disconnected from cluster; trying another server 2023-10-22T15:21:37.865Z|00052|reconnect|INFO|ssl:192.169.1.91:9642: connected 2023-10-22T15:21:37.867Z|00053|reconnect|INFO|ssl:192.169.1.91:9641: connected 2023-10-22T15:21:37.867Z|00054|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2023-10-22T15:21:37.867Z|00055|ovsdb_cs|INFO|ssl:192.169.1.91:9642: clustered database server is disconnected from cluster; trying another server 2023-10-22T15:21:37.867Z|00056|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. 2023-10-22T15:21:37.870Z|00057|ovsdb_cs|INFO|ssl:192.169.1.91:9641: clustered database server is not cluster leader; trying another server 2023-10-22T15:21:45.878Z|00058|reconnect|INFO|ssl:192.169.2.219:9642: connected 2023-10-22T15:21:45.879Z|00059|reconnect|INFO|ssl:192.169.1.138:9641: connected 2023-10-22T15:21:45.880Z|00060|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2023-10-22T15:21:45.880Z|00061|ovsdb_cs|INFO|ssl:192.169.2.219:9642: clustered database server is disconnected from cluster; trying another server 2023-10-22T15:21:45.880Z|00062|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. 2023-10-22T15:21:53.893Z|00063|reconnect|INFO|ssl:192.169.1.138:9642: connected 2023-10-22T15:21:53.895Z|00064|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2023-10-22T15:21:53.896Z|00065|ovsdb_cs|INFO|ssl:192.169.1.138:9642: clustered database server is disconnected from cluster; trying another server 2023-10-22T15:21:53.896Z|00066|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. 2023-10-22T15:22:01.906Z|00067|reconnect|INFO|ssl:192.169.1.91:9642: connected 2023-10-22T15:22:01.908Z|00068|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2023-10-22T15:22:01.908Z|00069|ovsdb_cs|INFO|ssl:192.169.1.91:9642: clustered database server is disconnected from cluster; trying another server 2023-10-22T15:22:01.908Z|00070|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. 2023-10-22T15:22:09.914Z|00071|reconnect|INFO|ssl:192.169.2.219:9642: connected 2023-10-22T15:22:09.915Z|00072|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active. 2023-10-22T15:22:09.916Z|00073|ovsdb_cs|INFO|ssl:192.169.2.219:9642: clustered database server is disconnected from cluster; trying another server 2023-10-22T15:22:09.916Z|00074|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby. 2023-10-22T15:22:17.922Z|00075|reconnect|INFO|ssl:192.169.1.138:9642: connected 2023-10-22T15:22:26.696Z|00076|memory|INFO|peak resident set size grew 120% in last 88.9 seconds, from 12336 kB to 27112 kB 2023-10-22T15:22:26.703Z|00077|memory|INFO|idl-cells-OVN_Northbound:7559 idl-cells-OVN_Southbound:32114