Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.13.z
Component/s: Node / Node Problem Detector
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

When diagnosing a few errors with some HyperShift clusters, we noticed that all of the Hosted Clusters that were effected all had components running on a single node, all presenting with `CreateContainerError`.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

$ for ns in $(oc get ns | grep -E 'namespace-.*-.*' | grep -v Terminating | awk '{print $1 }'); do echo "$ns"; oc get pods -n $ns -o wide | grep "Error"; echo "---"; done | grep -C1 Error
namespace-abcdefg-9jadj3
router-6465c96c76-5k7xn                                  0/1     CreateContainerError   1 (161m ago)    25d     10.128.62.166   ip-10-0-1-50.ec2.internal    <none>           <none>
---
--
namespace-abcdefg-e92dk
kube-controller-manager-abcdefg-8x5qq                 1/2     CreateContainerError   0 (151m ago)   10d     10.128.63.155   ip-10-0-1-50.ec2.internal    <none>           <none>
router-abcdefg-xrtzg                                  0/1     CreateContainerError   1 (161m ago)   15d     10.128.63.20    ip-10-0-1-50.ec2.internal    <none>           <none>
---
--
namespace-abcdefg-j389s
router-abcdefg-l6q8c                                  0/1     CreateContainerError   1 (161m ago)   17d     10.128.62.156   ip-10-0-1-50.ec2.internal    <none>           <none>
---
--
namespace-abcdefg-ej82j
hosted-cluster-config-operator-abcdefg-bwdb8          0/1     CreateContainerError   0 (160m ago)   6d6h    10.128.63.73    ip-10-0-1-50.ec2.internal    <none>           <none>
router-57c69765b5-mrdbc                                  0/1     CreateContainerError   0 (161m ago)   6d6h    10.128.63.50    ip-10-0-1-50.ec2.internal    <none>           <none>
---
--
namespace-abcdefg-cki7h
ovnkube-master-0                                         6/7     CreateContainerError   0 (161m ago)   5h26m   10.128.62.135   ip-10-0-1-50.ec2.internal    <none>           <none>
router-67cdc987fc-4zztr                                  0/1     CreateContainerError   0 (161m ago)   5h27m   10.128.62.95    ip-10-0-1-50.ec2.internal    <none>           <none>
---
--
namespace-abcdefg-iwepv
openshift-oauth-apiserver-abcdefg-7pc99               1/2     CreateContainerError   0 (150m ago)   3h58m   10.128.62.160   ip-10-0-1-50.ec2.internal    <none>           <none>
router-7ff9ff7ddb-vfhk5                                  0/1     CreateContainerError   0 (161m ago)   3h58m   10.128.62.157   ip-10-0-1-50.ec2.internal    <none>           <none>
---
--
namespace-abcdefg-jzyt3
openshift-oauth-apiserver-abcdefg-f69zm                1/2     CreateContainerError   0 (150m ago)   3h14m   10.128.62.202   ip-10-0-1-50.ec2.internal    <none>           <none>
router-74dc876ccf-chhzs                                  0/1     CreateContainerError   0 (161m ago)   3h15m   10.128.62.194   ip-10-0-1-50.ec2.internal    <none>           <none>
---

Expected results:

I would expect that if enough CreateContainerError's popped up that the node would be recycled. This may have been a transient error, but it also seemed like this error ALSO prevented these pods controllers from attempting to recreate them properly and left the resources dangling.

Additional info:

Fixing the problem was simple after identifying the issue - once we were able to trace all of this to the single bad node we just ran `oc delete machine -n openshift-machine-api [machine-name]` on the backing machine for that node and allowed the machine-controller to replace the node. However, having to manually intervene is not ideal, and I'd either like guidance as to how we can prevent this issue upstream by either replacing the node given enough pods on one stuck in CreateContainerError state _or_ by figuring out why the pods controllers didn't attempt to retry creating the pods themselves because the deployments for those didn't match the spec.

Assignee:: Peter Hunt

Reporter:: Kirk Bater

Need Info From:: None

Contributors:: None

QA Contact:: Sunil Choudhary

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/07/17 6:50 PM

Updated:: 2025/07/26 5:30 AM

Resolved:: 2024/01/30 4:13 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates