Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-16283

Handle multiple CreateContainerError pods on Worker Nodes gracefully

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When diagnosing a few errors with some HyperShift clusters, we noticed that all of the Hosted Clusters that were effected all had components running on a single node, all presenting with `CreateContainerError`.
      

      Version-Release number of selected component (if applicable):

      
      

      How reproducible:

      
      

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      $ for ns in $(oc get ns | grep -E 'namespace-.*-.*' | grep -v Terminating | awk '{print $1 }'); do echo "$ns"; oc get pods -n $ns -o wide | grep "Error"; echo "---"; done | grep -C1 Error
      namespace-abcdefg-9jadj3
      router-6465c96c76-5k7xn                                  0/1     CreateContainerError   1 (161m ago)    25d     10.128.62.166   ip-10-0-1-50.ec2.internal    <none>           <none>
      ---
      --
      namespace-abcdefg-e92dk
      kube-controller-manager-abcdefg-8x5qq                 1/2     CreateContainerError   0 (151m ago)   10d     10.128.63.155   ip-10-0-1-50.ec2.internal    <none>           <none>
      router-abcdefg-xrtzg                                  0/1     CreateContainerError   1 (161m ago)   15d     10.128.63.20    ip-10-0-1-50.ec2.internal    <none>           <none>
      ---
      --
      namespace-abcdefg-j389s
      router-abcdefg-l6q8c                                  0/1     CreateContainerError   1 (161m ago)   17d     10.128.62.156   ip-10-0-1-50.ec2.internal    <none>           <none>
      ---
      --
      namespace-abcdefg-ej82j
      hosted-cluster-config-operator-abcdefg-bwdb8          0/1     CreateContainerError   0 (160m ago)   6d6h    10.128.63.73    ip-10-0-1-50.ec2.internal    <none>           <none>
      router-57c69765b5-mrdbc                                  0/1     CreateContainerError   0 (161m ago)   6d6h    10.128.63.50    ip-10-0-1-50.ec2.internal    <none>           <none>
      ---
      --
      namespace-abcdefg-cki7h
      ovnkube-master-0                                         6/7     CreateContainerError   0 (161m ago)   5h26m   10.128.62.135   ip-10-0-1-50.ec2.internal    <none>           <none>
      router-67cdc987fc-4zztr                                  0/1     CreateContainerError   0 (161m ago)   5h27m   10.128.62.95    ip-10-0-1-50.ec2.internal    <none>           <none>
      ---
      --
      namespace-abcdefg-iwepv
      openshift-oauth-apiserver-abcdefg-7pc99               1/2     CreateContainerError   0 (150m ago)   3h58m   10.128.62.160   ip-10-0-1-50.ec2.internal    <none>           <none>
      router-7ff9ff7ddb-vfhk5                                  0/1     CreateContainerError   0 (161m ago)   3h58m   10.128.62.157   ip-10-0-1-50.ec2.internal    <none>           <none>
      ---
      --
      namespace-abcdefg-jzyt3
      openshift-oauth-apiserver-abcdefg-f69zm                1/2     CreateContainerError   0 (150m ago)   3h14m   10.128.62.202   ip-10-0-1-50.ec2.internal    <none>           <none>
      router-74dc876ccf-chhzs                                  0/1     CreateContainerError   0 (161m ago)   3h15m   10.128.62.194   ip-10-0-1-50.ec2.internal    <none>           <none>
      ---
      

      Expected results:

      I would expect that if enough CreateContainerError's popped up that the node would be recycled. This may have been a transient error, but it also seemed like this error ALSO prevented these pods controllers from attempting to recreate them properly and left the resources dangling.
      

      Additional info:

      Fixing the problem was simple after identifying the issue - once we were able to trace all of this to the single bad node we just ran `oc delete machine -n openshift-machine-api [machine-name]` on the backing machine for that node and allowed the machine-controller to replace the node. However, having to manually intervene is not ideal, and I'd either like guidance as to how we can prevent this issue upstream by either replacing the node given enough pods on one stuck in CreateContainerError state _or_ by figuring out why the pods controllers didn't attempt to retry creating the pods themselves because the deployments for those didn't match the spec.
      

              pehunt@redhat.com Peter Hunt
              iamkirkbater Kirk Bater
              Sunil Choudhary Sunil Choudhary
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: