Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: Networking / ovn-kubernetes
Labels:

Severity:
Important
Regression:
No
Sprint:
SDN Sprint 242, SDN Sprint 243, SDN Sprint 244
sprint_count:
3
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
RH Private Keywords:
Target Version:

4.13.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

A new installation took almost 8 hours to complete; the main stopper looks to be the "installer" pods for "kube-apiserver" in "master0" with this error:

2023-09-18T00:47:36.278647573Z W0918 00:47:36.278458  162649 helper_linux.go:390] [openshift-kube-apiserver/revision-pruner-7-xxxxxxxxxxxxxxxxxxxxxxxxxxxxx 5567be31134b88d449c582672f63864103ccfdc45d1aca01ae92d28655b7398b] pod uid dcd66366-7f39-4cdc-b007-7a604e47409e: timed out waiting for OVS port binding (ovn-installed) for 0a:0b:0c:0d:0e:0f [172.xx.yy.zz/23]

Version-Release number of selected component (if applicable):

OCP 4.12.31 - Assisted installer
3 masters (workable masters) , 4 workers

How reproducible:

Happens automatically in around 5% of deployments

Steps to Reproduce:

1.
2.
3.

Actual results:

Installation get halted for several hours until it manages to self-heal and continue

Expected results:

Reasonable quick install

Additional info:

This message has been repeating from "2023-09-18T00:50:55" until "2023-09-18T06:27:41".
During this same period of time, the "ovn-controller" container was unable to connect to the Southbound DB server:
~~~
2023-09-17T23:54:52.860817846Z 2023-09-17T23:54:52.860Z|00022|ovsdb_cs|INFO|ssl:10.x.y.z:9642: clustered database server is disconnected from cluster; trying another server
~~~

and a little bit later database was marked as "stale" and continued this way:
~~~ 2023-09-18T00:41:41.530734555Z 2023-09-18T00:41:41.530Z|00120|ovsdb_cs|WARN|ssl:10.x.y.z:9642: clustered database server has stale data; trying another server
~~~

until it managed to recover (almost 6 hour later):
~~~ 2023-09-18T06:27:39.021217311Z 2023-09-18T06:27:39.021Z|05271|reconnect|INFO|ssl:10.x.y.z:9642: connected
~~~

Cluster operator was showing problems during the deployment:
~~~
2023-09-18T00:35:28.817074935Z I0918 00:35:28.816913       1 log.go:198] Set ClusterOperator conditions:
2023-09-18T00:35:28.817074935Z - lastTransitionTime: "2023-09-18T00:14:48Z"
2023-09-18T00:35:28.817074935Z   message: |-
2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2023-09-18T00:04:09Z
2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-mk4pw is in CrashLoopBackOff State
2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-p8sth is in CrashLoopBackOff State
2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - pod ovnkube-master-ppz22 is in CrashLoopBackOff State
2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - last change 2023-09-17T23:57:08Z
2023-09-18T00:35:28.817074935Z   reason: RolloutHung
2023-09-18T00:35:28.817074935Z   status: "True"
2023-09-18T00:35:28.817074935Z   type: Degraded
2023-09-18T00:35:28.817074935Z - lastTransitionTime: "2023-09-17T23:17:44Z"
2023-09-18T00:35:28.817074935Z   status: "False"
2023-09-18T00:35:28.817074935Z   type: ManagementStateDegraded
2023-09-18T00:35:28.817074935Z - lastTransitionTime: "2023-09-17T23:17:44Z"
2023-09-18T00:35:28.817074935Z   status: "True"
2023-09-18T00:35:28.817074935Z   type: Upgradeable
2023-09-18T00:35:28.817074935Z - lastTransitionTime: "2023-09-17T23:54:22Z"
2023-09-18T00:35:28.817074935Z   message: |-
2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 2 nodes)
2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-ovn-kubernetes/ovnkube-master" is not available (awaiting 3 nodes)
2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-multus/multus" is not available (awaiting 1 nodes)
2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 2 nodes)
2023-09-18T00:35:28.817074935Z     DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" update is rolling out (3 out of 7 updated)
2023-09-18T00:35:28.817074935Z   reason: Deploying
2023-09-18T00:35:28.817074935Z   status: "True"
2023-09-18T00:35:28.817074935Z   type: Progressing
2023-09-18T00:35:28.817074935Z - lastTransitionTime: "2023-09-17T23:26:57Z"
2023-09-18T00:35:28.817074935Z   status: "True"
2023-09-18T00:35:28.817074935Z   type: Available
~~~

But after some minutes it manages to recover:
~~~
2023-09-18T00:46:44.256028807Z I0918 00:46:44.255888       1 log.go:198] Set ClusterOperator conditions:
2023-09-18T00:46:44.256028807Z - lastTransitionTime: "2023-09-18T00:46:44Z"
2023-09-18T00:46:44.256028807Z   status: "False"
2023-09-18T00:46:44.256028807Z   type: Degraded
2023-09-18T00:46:44.256028807Z - lastTransitionTime: "2023-09-17T23:17:44Z"
2023-09-18T00:46:44.256028807Z   status: "False"
2023-09-18T00:46:44.256028807Z   type: ManagementStateDegraded
2023-09-18T00:46:44.256028807Z - lastTransitionTime: "2023-09-17T23:17:44Z"
2023-09-18T00:46:44.256028807Z   status: "True"
2023-09-18T00:46:44.256028807Z   type: Upgradeable
2023-09-18T00:46:44.256028807Z - lastTransitionTime: "2023-09-18T00:46:44Z"
2023-09-18T00:46:44.256028807Z   status: "False"
2023-09-18T00:46:44.256028807Z   type: Progressing
2023-09-18T00:46:44.256028807Z - lastTransitionTime: "2023-09-17T23:26:57Z"
2023-09-18T00:46:44.256028807Z   status: "True"
2023-09-18T00:46:44.256028807Z   type: Available
~~~

Even though the cluster operator looks healthy at this point, seems like something happened on master0 that prevented ovn-controller from connecting correctly to the "sbdb" and hence correctly manage the network on pods.

Assignee:: Riccardo Ravaioli

Reporter:: Mario Abajo Duran

QA Contact:: Anurag Saxena

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/09/20 10:28 AM

Updated:: 2023/10/31 4:57 PM

Resolved:: 2023/10/31 4:57 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates