-
Bug
-
Resolution: Can't Do
-
Normal
-
None
-
4.18
-
Quality / Stability / Reliability
-
False
-
-
1
-
None
-
None
-
None
-
None
-
None
-
MCO Sprint 277
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
OCP 4.18 HUB cluster fails with error where the component running rpm-ostree on HCP worker nodes do not retry when encountering an issue, causing the cluster to get stuck during the upgrade process from a single failed image pull. The customer when using a pull-through registry cache, the first image pull for the first node timeout, but never retries. resulting in the entire node pool become stuck.
In addition due to unnecessarily transient nature of the MCD pods in hypershift it is impossible to capture pod logs for diagnosis making this issue extremely difficult to diagnose with very little gained by this feature.
Upon troubleshooting, it was noted that the `machineconfig` daemon `pod` is not always running like in a regular cluster. Instead, when the desired `mcd` config is different than current config, we deploy the `pod` onto the worker `node` to do the upgrade in place.
Please refer to this.
The inspect of the upgrade namespace from the hosted cluster is available here.
Please also refer to MCO-358, as it seems to be related to this request.