Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19451

[OCP 4.12] failed to configure pod interface: timed out waiting for OVS port binding in a new installation

XMLWordPrintable

    • Important
    • No
    • SDN Sprint 242, SDN Sprint 243, SDN Sprint 244
    • 3
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      A new installation took almost 8 hours to complete; the main stopper looks to be the "installer" pods for "kube-apiserver" in "master0" with this error:
      
      2023-09-18T00:47:36.278647573Z W0918 00:47:36.278458  162649 helper_linux.go:390] [openshift-kube-apiserver/revision-pruner-7-xxxxxxxxxxxxxxxxxxxxxxxxxxxxx 5567be31134b88d449c582672f63864103ccfdc45d1aca01ae92d28655b7398b] pod uid dcd66366-7f39-4cdc-b007-7a604e47409e: timed out waiting for OVS port binding (ovn-installed) for 0a:0b:0c:0d:0e:0f [172.xx.yy.zz/23]
      
      

      Version-Release number of selected component (if applicable):

      OCP 4.12.31 - Assisted installer
      3 masters (workable masters) , 4 workers

      How reproducible:

      Happens automatically in around 5% of deployments

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      Installation get halted for several hours until it manages to self-heal and continue

      Expected results:

      Reasonable quick install

      Additional info:

      This message has been repeating from "2023-09-18T00:50:55" until "2023-09-18T06:27:41".
      During this same period of time, the "ovn-controller" container was unable to connect to the Southbound DB server:
      ~~~
      2023-09-17T23:54:52.860817846Z 2023-09-17T23:54:52.860Z|00022|ovsdb_cs|INFO|ssl:10.x.y.z:9642: clustered database server is disconnected from cluster; trying another server
      ~~~
      
      and a little bit later database was marked as "stale" and continued this way:
      ~~~ 2023-09-18T00:41:41.530734555Z 2023-09-18T00:41:41.530Z|00120|ovsdb_cs|WARN|ssl:10.x.y.z:9642: clustered database server has stale data; trying another server
      ~~~
      
      until it managed to recover (almost 6 hour later):
      ~~~ 2023-09-18T06:27:39.021217311Z 2023-09-18T06:27:39.021Z|05271|reconnect|INFO|ssl:10.x.y.z:9642: connected
      ~~~
      
      Cluster operator was showing problems during the deployment:
      ~~~
      2023-09-18T00:35:28.817074935Z I0918 00:35:28.816913       1 log.go:198] Set ClusterOperator conditions:
      2023-09-18T00:35:28.817074935Z - lastTransitionTime: "2023-09-18T00:14:48Z"
      2023-09-18T00:35:28.817074935Z   message: |-
      2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2023-09-18T00:04:09Z
      2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-mk4pw is in CrashLoopBackOff State
      2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-p8sth is in CrashLoopBackOff State
      2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-ppz22 is in CrashLoopBackOff State
      2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - last change 2023-09-17T23:57:08Z
      2023-09-18T00:35:28.817074935Z   reason: RolloutHung
      2023-09-18T00:35:28.817074935Z   status: "True"
      2023-09-18T00:35:28.817074935Z   type: Degraded
      2023-09-18T00:35:28.817074935Z - lastTransitionTime: "2023-09-17T23:17:44Z"
      2023-09-18T00:35:28.817074935Z   status: "False"
      2023-09-18T00:35:28.817074935Z   type: ManagementStateDegraded
      2023-09-18T00:35:28.817074935Z - lastTransitionTime: "2023-09-17T23:17:44Z"
      2023-09-18T00:35:28.817074935Z   status: "True"
      2023-09-18T00:35:28.817074935Z   type: Upgradeable
      2023-09-18T00:35:28.817074935Z - lastTransitionTime: "2023-09-17T23:54:22Z"
      2023-09-18T00:35:28.817074935Z   message: |-
      2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 2 nodes)
      2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" is not available (awaiting 3 nodes)
      2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
      2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 2 nodes)
      2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" update is rolling out (3 out of 7 updated)
      2023-09-18T00:35:28.817074935Z   reason: Deploying
      2023-09-18T00:35:28.817074935Z   status: "True"
      2023-09-18T00:35:28.817074935Z   type: Progressing
      2023-09-18T00:35:28.817074935Z - lastTransitionTime: "2023-09-17T23:26:57Z"
      2023-09-18T00:35:28.817074935Z   status: "True"
      2023-09-18T00:35:28.817074935Z   type: Available
      ~~~
      
      But after some minutes it manages to recover:
      ~~~
      2023-09-18T00:46:44.256028807Z I0918 00:46:44.255888       1 log.go:198] Set ClusterOperator conditions:
      2023-09-18T00:46:44.256028807Z - lastTransitionTime: "2023-09-18T00:46:44Z"
      2023-09-18T00:46:44.256028807Z   status: "False"
      2023-09-18T00:46:44.256028807Z   type: Degraded
      2023-09-18T00:46:44.256028807Z - lastTransitionTime: "2023-09-17T23:17:44Z"
      2023-09-18T00:46:44.256028807Z   status: "False"
      2023-09-18T00:46:44.256028807Z   type: ManagementStateDegraded
      2023-09-18T00:46:44.256028807Z - lastTransitionTime: "2023-09-17T23:17:44Z"
      2023-09-18T00:46:44.256028807Z   status: "True"
      2023-09-18T00:46:44.256028807Z   type: Upgradeable
      2023-09-18T00:46:44.256028807Z - lastTransitionTime: "2023-09-18T00:46:44Z"
      2023-09-18T00:46:44.256028807Z   status: "False"
      2023-09-18T00:46:44.256028807Z   type: Progressing
      2023-09-18T00:46:44.256028807Z - lastTransitionTime: "2023-09-17T23:26:57Z"
      2023-09-18T00:46:44.256028807Z   status: "True"
      2023-09-18T00:46:44.256028807Z   type: Available
      ~~~
      
      Even though the cluster operator looks healthy at this point, seems like something happened on master0 that prevented ovn-controller from connecting correctly to the "sbdb" and hence correctly manage the network on pods.

       

              rravaiol@redhat.com Riccardo Ravaioli
              rhn-support-mabajodu Mario Abajo Duran
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: