-
Bug
-
Resolution: Cannot Reproduce
-
Critical
-
None
-
4.12.z
Description of problem:
A new installation took almost 8 hours to complete; the main stopper looks to be the "installer" pods for "kube-apiserver" in "master0" with this error: 2023-09-18T00:47:36.278647573Z W0918 00:47:36.278458 162649 helper_linux.go:390] [openshift-kube-apiserver/revision-pruner-7-xxxxxxxxxxxxxxxxxxxxxxxxxxxxx 5567be31134b88d449c582672f63864103ccfdc45d1aca01ae92d28655b7398b] pod uid dcd66366-7f39-4cdc-b007-7a604e47409e: timed out waiting for OVS port binding (ovn-installed) for 0a:0b:0c:0d:0e:0f [172.xx.yy.zz/23]
Version-Release number of selected component (if applicable):
OCP 4.12.31 - Assisted installer 3 masters (workable masters) , 4 workers
How reproducible:
Happens automatically in around 5% of deployments
Steps to Reproduce:
1. 2. 3.
Actual results:
Installation get halted for several hours until it manages to self-heal and continue
Expected results:
Reasonable quick install
Additional info:
This message has been repeating from "2023-09-18T00:50:55" until "2023-09-18T06:27:41". During this same period of time, the "ovn-controller" container was unable to connect to the Southbound DB server: ~~~ 2023-09-17T23:54:52.860817846Z 2023-09-17T23:54:52.860Z|00022|ovsdb_cs|INFO|ssl:10.x.y.z:9642: clustered database server is disconnected from cluster; trying another server ~~~ and a little bit later database was marked as "stale" and continued this way: ~~~ 2023-09-18T00:41:41.530734555Z 2023-09-18T00:41:41.530Z|00120|ovsdb_cs|WARN|ssl:10.x.y.z:9642: clustered database server has stale data; trying another server ~~~ until it managed to recover (almost 6 hour later): ~~~ 2023-09-18T06:27:39.021217311Z 2023-09-18T06:27:39.021Z|05271|reconnect|INFO|ssl:10.x.y.z:9642: connected ~~~ Cluster operator was showing problems during the deployment: ~~~ 2023-09-18T00:35:28.817074935Z I0918 00:35:28.816913 1 log.go:198] Set ClusterOperator conditions: 2023-09-18T00:35:28.817074935Z - lastTransitionTime: "2023-09-18T00:14:48Z" 2023-09-18T00:35:28.817074935Z message: |- 2023-09-18T00:35:28.817074935Z DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2023-09-18T00:04:09Z 2023-09-18T00:35:28.817074935Z DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-mk4pw is in CrashLoopBackOff State 2023-09-18T00:35:28.817074935Z DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-p8sth is in CrashLoopBackOff State 2023-09-18T00:35:28.817074935Z DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-ppz22 is in CrashLoopBackOff State 2023-09-18T00:35:28.817074935Z DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - last change 2023-09-17T23:57:08Z 2023-09-18T00:35:28.817074935Z reason: RolloutHung 2023-09-18T00:35:28.817074935Z status: "True" 2023-09-18T00:35:28.817074935Z type: Degraded 2023-09-18T00:35:28.817074935Z - lastTransitionTime: "2023-09-17T23:17:44Z" 2023-09-18T00:35:28.817074935Z status: "False" 2023-09-18T00:35:28.817074935Z type: ManagementStateDegraded 2023-09-18T00:35:28.817074935Z - lastTransitionTime: "2023-09-17T23:17:44Z" 2023-09-18T00:35:28.817074935Z status: "True" 2023-09-18T00:35:28.817074935Z type: Upgradeable 2023-09-18T00:35:28.817074935Z - lastTransitionTime: "2023-09-17T23:54:22Z" 2023-09-18T00:35:28.817074935Z message: |- 2023-09-18T00:35:28.817074935Z DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 2 nodes) 2023-09-18T00:35:28.817074935Z DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" is not available (awaiting 3 nodes) 2023-09-18T00:35:28.817074935Z DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes) 2023-09-18T00:35:28.817074935Z DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 2 nodes) 2023-09-18T00:35:28.817074935Z DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" update is rolling out (3 out of 7 updated) 2023-09-18T00:35:28.817074935Z reason: Deploying 2023-09-18T00:35:28.817074935Z status: "True" 2023-09-18T00:35:28.817074935Z type: Progressing 2023-09-18T00:35:28.817074935Z - lastTransitionTime: "2023-09-17T23:26:57Z" 2023-09-18T00:35:28.817074935Z status: "True" 2023-09-18T00:35:28.817074935Z type: Available ~~~ But after some minutes it manages to recover: ~~~ 2023-09-18T00:46:44.256028807Z I0918 00:46:44.255888 1 log.go:198] Set ClusterOperator conditions: 2023-09-18T00:46:44.256028807Z - lastTransitionTime: "2023-09-18T00:46:44Z" 2023-09-18T00:46:44.256028807Z status: "False" 2023-09-18T00:46:44.256028807Z type: Degraded 2023-09-18T00:46:44.256028807Z - lastTransitionTime: "2023-09-17T23:17:44Z" 2023-09-18T00:46:44.256028807Z status: "False" 2023-09-18T00:46:44.256028807Z type: ManagementStateDegraded 2023-09-18T00:46:44.256028807Z - lastTransitionTime: "2023-09-17T23:17:44Z" 2023-09-18T00:46:44.256028807Z status: "True" 2023-09-18T00:46:44.256028807Z type: Upgradeable 2023-09-18T00:46:44.256028807Z - lastTransitionTime: "2023-09-18T00:46:44Z" 2023-09-18T00:46:44.256028807Z status: "False" 2023-09-18T00:46:44.256028807Z type: Progressing 2023-09-18T00:46:44.256028807Z - lastTransitionTime: "2023-09-17T23:26:57Z" 2023-09-18T00:46:44.256028807Z status: "True" 2023-09-18T00:46:44.256028807Z type: Available ~~~ Even though the cluster operator looks healthy at this point, seems like something happened on master0 that prevented ovn-controller from connecting correctly to the "sbdb" and hence correctly manage the network on pods.