Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-22251

ovn-kubernetes: NB DB Raft leader is unknown to the cluster node

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Major Major
    • None
    • 4.13
    • None
    • No
    • SDN Sprint 246
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      ovn-kubernetes: NB DB Raft leader is unknown to the cluster node.

      Few applications pods fail with:

      Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_mariadb-pipelines-definition-745fc46d64-wprjf_ds-1_bc4c8728-0d4c-45f9-b709-4eab802fa0ab_0(5b88e8ba48540ededa4a12b7eddba528b8721201578a371d53c120c07e4299f1): error adding pod ds-1_mariadb-pipelines-definition-745fc46d64-wprjf to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [ds-1/mariadb-pipelines-definition-745fc46d64-wprjf/bc4c8728-0d4c-45f9-b709-4eab802fa0ab:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[ds-1/mariadb-pipelines-definition-745fc46d64-wprjf 5b88e8ba48540ededa4a12b7eddba528b8721201578a371d53c120c07e4299f1] [ds-1/mariadb-pipelines-definition-745fc46d64-wprjf 5b88e8ba48540ededa4a12b7eddba528b8721201578a371d53c120c07e4299f1] failed to configure pod interface: timed out waiting for OVS port binding (ovn-installed) for 0a:58:0a:81:02:e8 [10.129.2.232/23] '

      Looking in ovnkube-master-qk4rdNamespaceNSopenshift-ovn-kubernetes I see:

      Readiness probe failed: NB DB Raft leader is unknown to the cluster node. + [[ ! ssl:192.169.1.138:9641,ssl:192.169.2.219:9641,ssl:192.169.1.91:9641 =~ .*:192\.169\.1\.91:.* ]] ++ /usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=3 cluster/status OVN_Northbound ++ grep 'Leader: unknown' + leader_status='Leader: unknown' + [[ ! -z Leader: unknown ]] + echo 'NB DB Raft leader is unknown to the cluster node.' + exit 1

       

       

      Version-Release number of selected component (if applicable):

      OCP 4.13.5

      How reproducible:

      Unknown

      Steps to Reproduce:

      1. Install OCP 4.13.5 with OVN on Openstack (PSI)
      2. Install RHODS 1.34 (Might not be the root cause, but it was done on the env)
      3. Install OCP Pipeline operator
      

      Actual results:

      Many applications failed with `error adding container to network "ovn-kubernetes": CNI request failed with status 400`, which prevent any operation with those application, for example - Pipeline server:

       

      Expected results:

       

      Additional info:

      Restarting OCP master nodes (Openstack instances) seemed to resolve it, but in openshift-ovn-kubernetes there's still a failed pod showing:

       $▶ oc logs -n openshift-ovn-kubernetes ovnkube-master-bwkhl
      Defaulted container "northd" out of: northd, nbdb, kube-rbac-proxy, sbdb, ovnkube-master, ovn-dbchecker
      + [[ -f /env/_master ]]
      + trap quit TERM INT
      ++ date -Iseconds
      + echo '2023-10-22T15:20:42+00:00 - starting ovn-northd'
      2023-10-22T15:20:42+00:00 - starting ovn-northd
      + wait 7
      + exec ovn-northd --no-chdir -vconsole:info -vfile:off '-vPATTERN:console:%D{%Y-%m-%dT%H:%M:%S.###Z}|%05N|%c%T|%p|%m' --ovnnb-db ssl:192.169.1.138:9641,ssl:192.169.2.219:9641,ssl:192.169.1.91:9641 --ovnsb-db ssl:192.169.1.138:9642,ssl:192.169.2.219:9642,ssl:192.169.1.91:9642 --pidfile /var/run/ovn/ovn-northd.pid --n-threads=4 -p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt
      2023-10-22T15:20:42.793Z|00001|ovn_northd|INFO|Using 4 threads
      2023-10-22T15:20:42.793Z|00002|ovn_northd|INFO|OVN internal version is : [23.03.1-20.27.0-70.6]
      2023-10-22T15:20:42.794Z|00003|ovn_parallel_hmap|INFO|Setting thread count to 4
      2023-10-22T15:20:42.794Z|00004|ovn_parallel_hmap|INFO|Creating new pool with size 4
      2023-10-22T15:20:42.800Z|00005|reconnect|INFO|ssl:192.169.1.138:9641: connecting...
      2023-10-22T15:20:42.800Z|00006|ovn_northd|INFO|OVN NB IDL reconnected, force recompute.
      2023-10-22T15:20:42.801Z|00007|reconnect|INFO|ssl:192.169.2.219:9642: connecting...
      2023-10-22T15:20:42.801Z|00008|reconnect|INFO|ssl:192.169.2.219:9642: connection attempt failed (Connection refused)
      2023-10-22T15:20:42.801Z|00009|reconnect|INFO|ssl:192.169.1.138:9642: connecting...
      2023-10-22T15:20:42.801Z|00010|ovn_northd|INFO|OVN SB IDL reconnected, force recompute.
      2023-10-22T15:20:42.801Z|00011|reconnect|INFO|ssl:192.169.1.138:9641: connection attempt failed (Connection refused)
      2023-10-22T15:20:42.801Z|00012|reconnect|INFO|ssl:192.169.2.219:9641: connecting...
      2023-10-22T15:20:42.801Z|00013|reconnect|INFO|ssl:192.169.2.219:9641: connection attempt failed (Connection refused)
      2023-10-22T15:20:42.801Z|00014|reconnect|INFO|ssl:192.169.1.91:9641: connecting...
      2023-10-22T15:20:42.801Z|00015|reconnect|INFO|ssl:192.169.1.138:9642: connection attempt failed (Connection refused)
      2023-10-22T15:20:42.801Z|00016|reconnect|INFO|ssl:192.169.1.91:9642: connecting...
      2023-10-22T15:20:42.802Z|00017|reconnect|INFO|ssl:192.169.1.91:9642: connection attempt failed (Connection refused)
      2023-10-22T15:20:42.807Z|00018|reconnect|INFO|ssl:192.169.1.91:9641: connected
      2023-10-22T15:20:42.810Z|00019|ovsdb_cs|INFO|ssl:192.169.1.91:9641: clustered database server is disconnected from cluster; trying another server
      2023-10-22T15:20:42.810Z|00020|reconnect|INFO|ssl:192.169.1.91:9641: connection attempt timed out
      2023-10-22T15:20:43.804Z|00021|reconnect|INFO|ssl:192.169.2.219:9642: connecting...
      2023-10-22T15:20:43.804Z|00022|reconnect|INFO|ssl:192.169.2.219:9642: connection attempt failed (Connection refused)
      2023-10-22T15:20:43.804Z|00023|reconnect|INFO|ssl:192.169.2.219:9642: waiting 2 seconds before reconnect
      2023-10-22T15:20:43.810Z|00024|reconnect|INFO|ssl:192.169.1.138:9641: connecting...
      2023-10-22T15:20:43.810Z|00025|reconnect|INFO|ssl:192.169.1.138:9641: connection attempt failed (Connection refused)
      2023-10-22T15:20:43.810Z|00026|reconnect|INFO|ssl:192.169.1.138:9641: waiting 2 seconds before reconnect
      2023-10-22T15:20:45.805Z|00027|reconnect|INFO|ssl:192.169.1.138:9642: connecting...
      2023-10-22T15:20:45.805Z|00028|reconnect|INFO|ssl:192.169.1.138:9642: connection attempt failed (Connection refused)
      2023-10-22T15:20:45.805Z|00029|reconnect|INFO|ssl:192.169.1.138:9642: waiting 4 seconds before reconnect
      2023-10-22T15:20:45.810Z|00030|reconnect|INFO|ssl:192.169.2.219:9641: connecting...
      2023-10-22T15:20:45.815Z|00031|reconnect|INFO|ssl:192.169.2.219:9641: connected
      2023-10-22T15:20:45.816Z|00032|ovsdb_cs|INFO|ssl:192.169.2.219:9641: clustered database server is disconnected from cluster; trying another server
      2023-10-22T15:20:45.816Z|00033|reconnect|INFO|ssl:192.169.2.219:9641: connection attempt timed out
      2023-10-22T15:20:45.816Z|00034|reconnect|INFO|ssl:192.169.2.219:9641: waiting 4 seconds before reconnect
      2023-10-22T15:20:49.809Z|00035|reconnect|INFO|ssl:192.169.1.91:9642: connecting...
      2023-10-22T15:20:49.809Z|00036|reconnect|INFO|ssl:192.169.1.91:9642: connection attempt failed (Connection refused)
      2023-10-22T15:20:49.809Z|00037|reconnect|INFO|ssl:192.169.1.91:9642: continuing to reconnect in the background but suppressing further logging
      2023-10-22T15:20:49.817Z|00038|reconnect|INFO|ssl:192.169.1.91:9641: connecting...
      2023-10-22T15:20:49.822Z|00039|reconnect|INFO|ssl:192.169.1.91:9641: connected
      2023-10-22T15:20:49.824Z|00040|ovsdb_cs|INFO|ssl:192.169.1.91:9641: clustered database server is disconnected from cluster; trying another server
      2023-10-22T15:20:49.824Z|00041|reconnect|INFO|ssl:192.169.1.91:9641: connection attempt timed out
      2023-10-22T15:20:49.824Z|00042|reconnect|INFO|ssl:192.169.1.91:9641: continuing to reconnect in the background but suppressing further logging
      2023-10-22T15:20:57.818Z|00043|memory|INFO|12336 kB peak resident set size after 15.0 seconds
      2023-10-22T15:21:05.830Z|00044|reconnect|INFO|ssl:192.169.2.219:9641: connected
      2023-10-22T15:21:05.832Z|00045|ovsdb_cs|INFO|ssl:192.169.2.219:9641: clustered database server is disconnected from cluster; trying another server
      2023-10-22T15:21:13.839Z|00046|reconnect|INFO|ssl:192.169.1.91:9641: connected
      2023-10-22T15:21:13.842Z|00047|ovsdb_cs|INFO|ssl:192.169.1.91:9641: clustered database server is disconnected from cluster; trying another server
      2023-10-22T15:21:21.848Z|00048|reconnect|INFO|ssl:192.169.1.138:9641: connected
      2023-10-22T15:21:21.850Z|00049|ovsdb_cs|INFO|ssl:192.169.1.138:9641: clustered database server is disconnected from cluster; trying another server
      2023-10-22T15:21:29.856Z|00050|reconnect|INFO|ssl:192.169.2.219:9641: connected
      2023-10-22T15:21:29.859Z|00051|ovsdb_cs|INFO|ssl:192.169.2.219:9641: clustered database server is disconnected from cluster; trying another server
      2023-10-22T15:21:37.865Z|00052|reconnect|INFO|ssl:192.169.1.91:9642: connected
      2023-10-22T15:21:37.867Z|00053|reconnect|INFO|ssl:192.169.1.91:9641: connected
      2023-10-22T15:21:37.867Z|00054|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
      2023-10-22T15:21:37.867Z|00055|ovsdb_cs|INFO|ssl:192.169.1.91:9642: clustered database server is disconnected from cluster; trying another server
      2023-10-22T15:21:37.867Z|00056|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
      2023-10-22T15:21:37.870Z|00057|ovsdb_cs|INFO|ssl:192.169.1.91:9641: clustered database server is not cluster leader; trying another server
      2023-10-22T15:21:45.878Z|00058|reconnect|INFO|ssl:192.169.2.219:9642: connected
      2023-10-22T15:21:45.879Z|00059|reconnect|INFO|ssl:192.169.1.138:9641: connected
      2023-10-22T15:21:45.880Z|00060|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
      2023-10-22T15:21:45.880Z|00061|ovsdb_cs|INFO|ssl:192.169.2.219:9642: clustered database server is disconnected from cluster; trying another server
      2023-10-22T15:21:45.880Z|00062|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
      2023-10-22T15:21:53.893Z|00063|reconnect|INFO|ssl:192.169.1.138:9642: connected
      2023-10-22T15:21:53.895Z|00064|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
      2023-10-22T15:21:53.896Z|00065|ovsdb_cs|INFO|ssl:192.169.1.138:9642: clustered database server is disconnected from cluster; trying another server
      2023-10-22T15:21:53.896Z|00066|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
      2023-10-22T15:22:01.906Z|00067|reconnect|INFO|ssl:192.169.1.91:9642: connected
      2023-10-22T15:22:01.908Z|00068|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
      2023-10-22T15:22:01.908Z|00069|ovsdb_cs|INFO|ssl:192.169.1.91:9642: clustered database server is disconnected from cluster; trying another server
      2023-10-22T15:22:01.908Z|00070|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
      2023-10-22T15:22:09.914Z|00071|reconnect|INFO|ssl:192.169.2.219:9642: connected
      2023-10-22T15:22:09.915Z|00072|ovn_northd|INFO|ovn-northd lock acquired. This ovn-northd instance is now active.
      2023-10-22T15:22:09.916Z|00073|ovsdb_cs|INFO|ssl:192.169.2.219:9642: clustered database server is disconnected from cluster; trying another server
      2023-10-22T15:22:09.916Z|00074|ovn_northd|INFO|ovn-northd lock lost. This ovn-northd instance is now on standby.
      2023-10-22T15:22:17.922Z|00075|reconnect|INFO|ssl:192.169.1.138:9642: connected
      2023-10-22T15:22:26.696Z|00076|memory|INFO|peak resident set size grew 120% in last 88.9 seconds, from 12336 kB to 27112 kB
      2023-10-22T15:22:26.703Z|00077|memory|INFO|idl-cells-OVN_Northbound:7559 idl-cells-OVN_Southbound:32114

       

       

              jcaamano@redhat.com Jaime Caamaño Ruiz
              nmanos@redhat.com Noam Manos
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: