Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-33059

Cluster upgrade stuck: openshift-ovn-kubernetes Pod unable to start due to Pod stuck Terminating on same node

XMLWordPrintable

    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      On an Azure Red Hat OpenShift cluster that was in the middle of upgrading from 4.13.40 to 4.14.21, the upgrade became stuck on the network CO.
      
      Upon investigation, I noticed that:
      
      - A particular VM's openshift-ovn-kubernetes "ovnkube-node-xxxxx" Pod was unable to start due to the following error from the kube-scheduler shown in the Pod's Events:
      
      Warning  FailedScheduling  102m  default-scheduler  0/7 nodes are available: 1 Insufficient memory. preemption: not eligible due to a terminating pod on the nominated node..
      
      - Sure enough, there was a Pod on the node stuck Terminating, with this warning in its Events:
      
      Warning  FailedKillPod   89s (x140 over 106m)  kubelet            error killing pod: failed to "KillPodSandbox" for "REDACTED" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox REDACTED: error removing pod REDACTED from CNI network \"multus-cni-network\": plugin type=\"multus\" name=\"multus-cni-network\" failed (delete): Multus: [REDACTED]: PollImmediate error waiting for ReadinessIndicatorFile (on del): timed out waiting for the condition"
      
        This was a customer Pod in a customer Namespace, which is why some data is redacted from the message as I've typed it here.
      
      - The same node's openshift-multus "multus-xxxxx" Pod was unable to start, and it appeared to be related to the fact that the openshift-ovn-kubernetes Pod was not ready.
      
      ---
      
      Given my very limited knowledge of the network CO, this seemed like it could have been the result of a race condition. I redeployed the affected node's underlying VM, and all network-related Pods started up, and the upgrade resumed progressing.

      Version-Release number of selected component (if applicable):

      Unsure

      How reproducible:

      Unsure

      Steps to Reproduce:

          1. Upgrade an ARO cluster from 4.13 to 4.14, and wait until the upgrade progresses to the network CO.
          2. Once the upgrade gets stuck on the network CO (if it does), check whether any particular node's openshift-ovn-kubernetes "ovnkube-node-xxxxx" is unable to start and a Pod is stuck Terminating on the same node, with the same Event messages that I gave in the problem description.
          

      Actual results:

      The cluster upgrade was stuck on the network CO and unable to progress until I redeployed the affected node's underlying VM.

      Expected results:

      The cluster upgrade proceeds without getting stuck on the network CO.

      Additional info:

      N/A

            bbennett@redhat.com Ben Bennett
            rh-ee-kimorris Kipp Morris
            Anurag Saxena Anurag Saxena
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: