Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14827

All worker nodes unable to successfully create pods

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • No
    • None
    • None
    • Rejected
    • OCPNODE Sprint 238 (Blue), OCPNODE Sprint 239 (Blue)
    • 2
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Containers fail to start and end up with this status:
      
          Last State:     Terminated
            Reason:       Error
            Exit Code:    139
            Started:      Fri, 09 Jun 2023 10:01:37 -0400
            Finished:     Fri, 09 Jun 2023 10:01:37 -0400
      
      and exit without showing any logs.
      
      For example, restarting a previously "Running" pod puts it into this state:
      
      ❯ k get po -n hypershift   
      NAME                            READY   STATUS             RESTARTS         AGE
      external-dns-7cc4b775d9-t558s   1/1     Running            0                2d20h
      operator-58f644b4bb-2rbt4       0/1     CrashLoopBackOff   17 (4m53s ago)   66m
      operator-58f644b4bb-w4mkf       1/1     Running            0                2d16h

      Version-Release number of selected component (if applicable):

      4.12.12

      How reproducible:

      Unsure

      Steps to Reproduce:

      Unsure
      

      Actual results:

      What we ended up doing on this cluster is replacing the worker machines, which created new nodes. By forcing pods to reschedule onto the new nodes, all pods were able to start successfully.
      
      This behavior was confusingly consistent for some pods - but not all, for example I was able to 
      
      kubectl run ubuntu --image ubuntu --rm -it
      
      and that worked just fine, or deleting/recreating an ovnkube-node pod scheduled on the same worker node, even "oc debug"ing onto an affected worker node and successfully running
      
      podman run --rm -it --entrypoint=bash ${CONTAINER_IMAGE}
      
      for pods that wouldn't start up

      Expected results:

      In one sense, that pods are able to start successfully (we do not understand what caused this bug)
      
      and on the other hand, that the new and old nodes have the same configuration. It is strange that the new replacement machines are viable, but the existing ones were not.

      Additional info:

      Must gather link: https://drive.google.com/file/d/10m7TpJEdmBbLec35PD9vpHqoIV8bYzW-/view?usp=sharing

       

      We have living clusters where this bug is still occuring and have cordoned nodes for live investigation if needed, please feel free to reach out if you would like to, we can screenshare!

              svanka@redhat.com Sai Ramesh Vanka
              mshen.openshift Michael Shen (Inactive)
              None
              None
              Sunil Choudhary Sunil Choudhary
              None
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated:
                Resolved: