-
Bug
-
Resolution: Cannot Reproduce
-
Normal
-
None
-
4.13.z
-
No
-
False
-
Description of problem:
When diagnosing a few errors with some HyperShift clusters, we noticed that all of the Hosted Clusters that were effected all had components running on a single node, all presenting with `CreateContainerError`.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
$ for ns in $(oc get ns | grep -E 'namespace-.*-.*' | grep -v Terminating | awk '{print $1 }'); do echo "$ns"; oc get pods -n $ns -o wide | grep "Error"; echo "---"; done | grep -C1 Error namespace-abcdefg-9jadj3 router-6465c96c76-5k7xn 0/1 CreateContainerError 1 (161m ago) 25d 10.128.62.166 ip-10-0-1-50.ec2.internal <none> <none> --- -- namespace-abcdefg-e92dk kube-controller-manager-abcdefg-8x5qq 1/2 CreateContainerError 0 (151m ago) 10d 10.128.63.155 ip-10-0-1-50.ec2.internal <none> <none> router-abcdefg-xrtzg 0/1 CreateContainerError 1 (161m ago) 15d 10.128.63.20 ip-10-0-1-50.ec2.internal <none> <none> --- -- namespace-abcdefg-j389s router-abcdefg-l6q8c 0/1 CreateContainerError 1 (161m ago) 17d 10.128.62.156 ip-10-0-1-50.ec2.internal <none> <none> --- -- namespace-abcdefg-ej82j hosted-cluster-config-operator-abcdefg-bwdb8 0/1 CreateContainerError 0 (160m ago) 6d6h 10.128.63.73 ip-10-0-1-50.ec2.internal <none> <none> router-57c69765b5-mrdbc 0/1 CreateContainerError 0 (161m ago) 6d6h 10.128.63.50 ip-10-0-1-50.ec2.internal <none> <none> --- -- namespace-abcdefg-cki7h ovnkube-master-0 6/7 CreateContainerError 0 (161m ago) 5h26m 10.128.62.135 ip-10-0-1-50.ec2.internal <none> <none> router-67cdc987fc-4zztr 0/1 CreateContainerError 0 (161m ago) 5h27m 10.128.62.95 ip-10-0-1-50.ec2.internal <none> <none> --- -- namespace-abcdefg-iwepv openshift-oauth-apiserver-abcdefg-7pc99 1/2 CreateContainerError 0 (150m ago) 3h58m 10.128.62.160 ip-10-0-1-50.ec2.internal <none> <none> router-7ff9ff7ddb-vfhk5 0/1 CreateContainerError 0 (161m ago) 3h58m 10.128.62.157 ip-10-0-1-50.ec2.internal <none> <none> --- -- namespace-abcdefg-jzyt3 openshift-oauth-apiserver-abcdefg-f69zm 1/2 CreateContainerError 0 (150m ago) 3h14m 10.128.62.202 ip-10-0-1-50.ec2.internal <none> <none> router-74dc876ccf-chhzs 0/1 CreateContainerError 0 (161m ago) 3h15m 10.128.62.194 ip-10-0-1-50.ec2.internal <none> <none> ---
Expected results:
I would expect that if enough CreateContainerError's popped up that the node would be recycled. This may have been a transient error, but it also seemed like this error ALSO prevented these pods controllers from attempting to recreate them properly and left the resources dangling.
Additional info:
Fixing the problem was simple after identifying the issue - once we were able to trace all of this to the single bad node we just ran `oc delete machine -n openshift-machine-api [machine-name]` on the backing machine for that node and allowed the machine-controller to replace the node. However, having to manually intervene is not ideal, and I'd either like guidance as to how we can prevent this issue upstream by either replacing the node given enough pods on one stuck in CreateContainerError state _or_ by figuring out why the pods controllers didn't attempt to retry creating the pods themselves because the deployments for those didn't match the spec.