-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
4.12.z
-
Critical
-
No
-
SDN Sprint 235, SDN Sprint 236, SDN Sprint 237, SDN Sprint 238, SDN Sprint 239
-
5
-
Rejected
-
False
-
-
Customer Escalated
-
We were on call with mshen.openshift and others and what we saw is that:
- the management cluster was upgrading to 4.12.10
- the worker nodes were getting drained and rebooted
- the ovnkube-master pods that were on these worker nodes (which manage another guest cluster) entered into a CLBO state
- they gave an inspect of the pods and from the inspect logs I saw
2023-04-13T19:00:43.627758417Z F0413 19:00:43.627737 1 ovnkube.go:133] error when trying to initialize libovsdb NB client: unable to connect to any endpoints: failed to connect to ssl:ovnkube-master-0.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local:9641: failed to open connection: dial tcp 10.128.26.57:9641: i/o timeout. failed to connect to ssl:ovnkube-master-1.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local:9641: failed to open connection: dial tcp: lookup ovnkube-master-1.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local: i/o timeout. failed to connect to ssl:ovnkube-master-2.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local:9641: failed to open connection: dial tcp: lookup ovnkube-master-2.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local: i/o timeout
- master was hence unable to talk to nbdb
- looking at nbdb logs we saw the following errors
2023-04-13T16:26:47.859231099Z 2023-04-13T16:26:47.859Z|146139|dns_resolve|WARN|Dropped 5 log messages in last 56 seconds (most recently, 12 seconds ago) due to excessive rate2023-04-13T16:26:47.859269067Z 2023-04-13T16:26:47.859Z|146140|dns_resolve|WARN|ovnkube-master-2.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local: failed to resolve2023-04-13T16:26:47.859295660Z 2023-04-13T16:26:47.859Z|146141|stream_ssl|ERR|ssl:ovnkube-master-2.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local:9643: connect: Address family not supported by protocol
- for some reason nbdb was not able to resolve the IP of the domains
- initially when I joined the call master-0 was healthy maser-1 and master-2 were in CLBO. We restarted master-1 and it started talking to master-2 but not master-0; then we restarted master-2 which started talking to master-1 but not master-0; finally we had to restart master-0 and all could talk to each other. {NOTE that initially master-1 and master-2 pods were on nodes that had already being drained and rebooted, master-0's upgrade had not yet finished}.
- Masters got healthy, incident is as of now resolved.
I thought this was a bug we had fixed in the past: https://bugzilla.redhat.com/show_bug.cgi?id=2103590 . I will follow up on the details and talk to Patryk.
ACTION items for SDN team as follow-up
- Figure out why the domain's couldn't get resolved, according to the fix: https://github.com/openshift/cluster-network-operator/pull/1503/files DNS should start to work when pod get's the status.PodIP set.
- Figure out why restarts helped fix the resolve connections {Does nbdb give up retrying after a certain point?} using the provided must-gather and inspect logs
ACTION items for SD team:
- they said they feel confident at being able to reproduce this, if we can get a recipe or a reproducer that would be great; we can then reproduce it in a similar manner on our side
- they will open an eng bug for this.
- is related to
-
HOSTEDCP-961 Create e2e to test management cluster node upgrade while running a hosted cluster
- To Do