Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-34890

Pods stuck terminating prevent ovn rollout

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Terminating pods are preventing ovn from rolling out during an upgrade. We've observed this on two separate clusters owned by separate customers.

      Version-Release number of selected component (if applicable):

      Both of the observations of this bug have occurred upgrading TO version 4.14.26.

      The one cluster we saw this happen to yesterday is currently hibernating and we cannot get the version that it upgraded from, but I will put the cluster ID in a comment in case we can get that information elsewhere.

      The cluster we saw this happen on today was upgrading from version 4.13.41.

      In both cases, this was during an upgrade and caused the upgrade to stop.

      How reproducible:

      Unknown

      Steps to Reproduce:

      Unknown

      Actual results:

      The cluster fails to roll out the network CO as the ovnkube-node DS pods are stuck in Pending with the error message below.

      Expected results:

      The terminating nodes do not prevent ovnkube from rolling out.

      Additional info:

      We see this on the failing ovnkube-node daemonset pods that are failing to deploy under Events in oc describe pod

      Warning  FailedScheduling  4m22s                 default-scheduler  0/13 nodes are available: 1 Insufficient memory. preemption: not eligible due to a terminating pod on the nominated node..
      

      Since the pods had 0/N containers, we force-terminated the pods that were stuck terminating and that allowed the ovnkube-node DS pods to schedule which allowed the upgrade to start to progress again.

      Also worthy to note that the pods that were stuck terminating today were not the same pods stuck terminating on the other cluster yesterday, there seems to be no correlation between the two workloads aside from hitting the same error.

      I will be attaching a Must Gather from today's cluster and an SOS report from one of the nodes that was failing to have the pods scheduled from yesterday's cluster.

      I will also add that I'm not sure if this is specifically an OVN issue or if this may be some other platform issue that happened to present as an OVN issue as the pods that were failing to roll out in both cases were OVN pods. Please feel free to move this to the appropriate component as necessary.

      Affected Platforms:

      ROSA, but I believe this to be a core OCP issue.

      Is it an

      1. customer issue / SD

      If it is a customer / SD issue:

      • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
      • Don’t presume that Engineering has access to Salesforce.
      • Do presume that Engineering will access attachments through supportshell.
      • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
      • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
        • If the issue is in a customer namespace then provide a namespace inspect.
        • If it is a connectivity issue:
          • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
          • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
          • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
          • Please provide the UTC timestamp networking outage window from must-gather
          • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
        • If it is not a connectivity issue:
          • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
      • When showing the results from commands, include the entire command in the output.  
      • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
      • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
      • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
      • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

              npinaeva@redhat.com Nadia Pinaeva
              iamkirkbater Kirk Bater
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

                Created:
                Updated:
                Resolved: