Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: Networking / ovn-kubernetes
Labels:

Activity Type:
Incidents & Support
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
SDN Sprint 235, SDN Sprint 236, SDN Sprint 237, SDN Sprint 238, SDN Sprint 239
sprint_count:
5

Customer Impact:

Customer Escalated
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

(Copied from https://issues.redhat.com/browse/RHOCPPRIO-144?focusedId=22100103&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22100103)

We were on call with mshen.openshift and others and what we saw is that:

the management cluster was upgrading to 4.12.10
the worker nodes were getting drained and rebooted
the ovnkube-master pods that were on these worker nodes (which manage another guest cluster) entered into a CLBO state
they gave an inspect of the pods and from the inspect logs I saw

2023-04-13T19:00:43.627758417Z F0413 19:00:43.627737       1 ovnkube.go:133] error when trying to initialize libovsdb NB client: unable to connect to any endpoints: failed to connect to ssl:ovnkube-master-0.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local:9641: failed to open connection: dial tcp 10.128.26.57:9641: i/o timeout. failed to connect to ssl:ovnkube-master-1.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local:9641: failed to open connection: dial tcp: lookup ovnkube-master-1.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local: i/o timeout. failed to connect to ssl:ovnkube-master-2.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local:9641: failed to open connection: dial tcp: lookup ovnkube-master-2.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local: i/o timeout

master was hence unable to talk to nbdb
looking at nbdb logs we saw the following errors

2023-04-13T16:26:47.859231099Z 2023-04-13T16:26:47.859Z|146139|dns_resolve|WARN|Dropped 5 log messages in last 56 seconds (most recently, 12 seconds ago) due to excessive rate2023-04-13T16:26:47.859269067Z 2023-04-13T16:26:47.859Z|146140|dns_resolve|WARN|ovnkube-master-2.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local: failed to resolve2023-04-13T16:26:47.859295660Z 2023-04-13T16:26:47.859Z|146141|stream_ssl|ERR|ssl:ovnkube-master-2.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local:9643: connect: Address family not supported by protocol

for some reason nbdb was not able to resolve the IP of the domains
initially when I joined the call master-0 was healthy maser-1 and master-2 were in CLBO. We restarted master-1 and it started talking to master-2 but not master-0; then we restarted master-2 which started talking to master-1 but not master-0; finally we had to restart master-0 and all could talk to each other. {NOTE that initially master-1 and master-2 pods were on nodes that had already being drained and rebooted, master-0's upgrade had not yet finished}.
Masters got healthy, incident is as of now resolved.

I thought this was a bug we had fixed in the past: https://bugzilla.redhat.com/show_bug.cgi?id=2103590 . I will follow up on the details and talk to Patryk.

ACTION items for SDN team as follow-up

Figure out why the domain's couldn't get resolved, according to the fix: https://github.com/openshift/cluster-network-operator/pull/1503/files DNS should start to work when pod get's the status.PodIP set.
Figure out why restarts helped fix the resolve connections {Does nbdb give up retrying after a certain point?} using the provided must-gather and inspect logs

ACTION items for SD team:

they said they feel confident at being able to reproduce this, if we can get a recipe or a reproducer that would be great; we can then reproduce it in a similar manner on our side
they will open an eng bug for this.

Assignee:: Nadia Pinaeva (Inactive)

Reporter:: Michael Shen (Inactive)

Need Info From:: None

Contributors:: None

QA Contact:: Anurag Saxena

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2023/04/14 2:44 PM

Updated:: 2025/07/02 1:08 PM

Resolved:: 2023/07/21 10:14 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide