Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-4618

[4.11] OVN silently failing in case of a stuck pod

XMLWordPrintable

    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When a pod becomes stuck in Terminating and the node it belongs too, no longer exists then OVNkube-master will become stuck in a loop

       

      The exact cause of how the pod ended up in this state is unknown however a good guess is this:

      • A pod was given a finalizer of 'test'
      • The pod was deleted but the finalizer 'test' was not removed
      • The node the pod was running on was cordoned
      • The node was deleted via machine/node health checks
      • The pod is now stuck in Terminating and the node does not exist. (This is the state the cluster was in)

      OVN then iterates over all pods to configure them. However the error logic in the for loop crashes out of the for loop when any error occurs instead of processing the rest of the for loop and collecting all the errors.

       

      What this resulted in was a single pod being responsible for the entire openshift control plane not coming up.

      https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/pods.go#L44

       

      To summarise line 44 should not throw a error until the end of the for loop.

       

      However Tim Rozet has indicated some refactoring may also make this code block redundant. 

              ffernand@redhat.com Flavio Fernandes (Inactive)
              iwatson@redhat.com Ian Watson (Inactive)
              Arti Sood Arti Sood
              Tim Rozet
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: