Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-11825

ovnkube-masters crashlooping while unable to communicate with nbdb

XMLWordPrintable

    • Critical
    • No
    • SDN Sprint 235, SDN Sprint 236, SDN Sprint 237, SDN Sprint 238, SDN Sprint 239
    • 5
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Customer Escalated

       (Copied from https://issues.redhat.com/browse/RHOCPPRIO-144?focusedId=22100103&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22100103)

      We were on call with mshen.openshift and others and what we saw is that:

      • the management cluster was upgrading to 4.12.10 
      • the worker nodes were getting drained and rebooted
      • the ovnkube-master pods that were on these worker nodes (which manage another guest cluster) entered into a CLBO state
      • they gave an inspect of the pods and from the inspect logs I saw 
      2023-04-13T19:00:43.627758417Z F0413 19:00:43.627737       1 ovnkube.go:133] error when trying to initialize libovsdb NB client: unable to connect to any endpoints: failed to connect to ssl:ovnkube-master-0.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local:9641: failed to open connection: dial tcp 10.128.26.57:9641: i/o timeout. failed to connect to ssl:ovnkube-master-1.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local:9641: failed to open connection: dial tcp: lookup ovnkube-master-1.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local: i/o timeout. failed to connect to ssl:ovnkube-master-2.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local:9641: failed to open connection: dial tcp: lookup ovnkube-master-2.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local: i/o timeout 
      • master was hence unable to talk to nbdb
      • looking at nbdb logs we saw the following errors
      2023-04-13T16:26:47.859231099Z 2023-04-13T16:26:47.859Z|146139|dns_resolve|WARN|Dropped 5 log messages in last 56 seconds (most recently, 12 seconds ago) due to excessive rate2023-04-13T16:26:47.859269067Z 2023-04-13T16:26:47.859Z|146140|dns_resolve|WARN|ovnkube-master-2.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local: failed to resolve2023-04-13T16:26:47.859295660Z 2023-04-13T16:26:47.859Z|146141|stream_ssl|ERR|ssl:ovnkube-master-2.ovnkube-master-internal.ocm-production-22npqomm54qf7s6pnovl2rsv4fajjicg-lh-hypershift1.svc.cluster.local:9643: connect: Address family not supported by protocol 
      • for some reason nbdb was not able to resolve the IP of the domains
      • initially when I joined the call master-0 was healthy maser-1 and master-2 were in CLBO. We restarted master-1 and it started talking to master-2 but not master-0; then we restarted master-2 which started talking to master-1 but not master-0; finally we had to restart master-0 and all could talk to each other. {NOTE that initially master-1 and master-2 pods were on nodes that had already being drained and rebooted, master-0's upgrade had not yet finished}.
      • Masters got healthy, incident is as of now resolved.

      I thought this was a bug we had fixed in the past: https://bugzilla.redhat.com/show_bug.cgi?id=2103590 . I will follow up on the details and talk to Patryk.

      ACTION items for SDN team as follow-up

      • Figure out why the domain's couldn't get resolved, according to the fix: https://github.com/openshift/cluster-network-operator/pull/1503/files DNS should start to work when pod get's the status.PodIP set.
      • Figure out why restarts helped fix the resolve connections {Does nbdb give up retrying after a certain point?} using the provided must-gather and inspect logs

      ACTION items for SD team:

      • they said they feel confident at being able to reproduce this, if we can get a recipe or a reproducer that would be great; we can then reproduce it in a similar manner on our side
      • they will open an eng bug for this.

              npinaeva@redhat.com Nadia Pinaeva
              mshen.openshift Michael Shen
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated:
                Resolved: