-
Bug
-
Resolution: Won't Do
-
Major
-
None
-
4.13.z
-
No
-
Rejected
-
False
-
Description of problem:
We've seen multiple incidents where OpenShift Dedicated/ROSA clusters fail to provision due to static pods failing to come online (or potentially failing after coming online), resulting in the ClusterOperator to become Degraded/Down and not finish provisioning. $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-scheduler 4.13.4 True True True 3h7m MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "openshift-kube-scheduler" in namespace: "openshift-kube-scheduler" for revision: 5 on node: "ip-10-0-212-204.us-east-2.compute.internal" didn't show up, waited: 3m0s... Attempting to grab an sosreport failed off of the node, as well as attempting to run oc adm inspect ns/openshift-kube-scheduler We've _also_ seen this with openshift-kube-controller-manager. We're also seeing these errors pop up on running clusters as well. The solution is to reboot the control plane node - but that won't help during a managed cluster install as this just fails, and our current recourse is to ask the customer to retry. We will be implementing a workaround to attempt to have Hive retry these cluster installs as well, but I think the root cause of the issue should still be investigated.
Version-Release number of selected component (if applicable):
We've seen this on 4.13.5 and 4.13.4 clusters since we started seeing this often enough to start collecting data on it. We will continue to update this JIRA as we find more versions where this is happening.
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
A static pod on a control plane node will die for unexpected reasons, and then the operator is degraded. During install this will cause the install to fail, resulting in a poor customer experience. After install, this causes operators to become degraded and require SRE intervention to reboot the affected node.
Expected results:
Additional info:
I've opened this as a Major level priority due to the impact on customer installs and the resulting customer experience. I'm also not sure if "Installer" is the correct component here, since this is a result of the static pods on a control plane node becoming degraded, but since it can happen with, at a minimum, kube-controller-manager and kube-scheduler workloads I'm not sure where the best place to open this card is, I just want to get _something_ on the radar and have a common collection point for any data we can collect from this. We're more than happy to get more data around this, especially if there's specific data we can collect to help diagnose this.
- causes
-
HIVE-2281 Hive: retry installation on GeneralOperatorDegraded
-
- To Do
-