Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Major
Fix Version/s: None
Affects Version/s: 4.13.z
Component/s: kube-scheduler
Labels:
- ServiceDeliveryImpact

Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

We've seen multiple incidents where OpenShift Dedicated/ROSA clusters fail to provision due to static pods failing to come online (or potentially failing after coming online), resulting in the ClusterOperator to become Degraded/Down and not finish provisioning.

$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
kube-scheduler                             4.13.4    True        True          True       3h7m    MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "openshift-kube-scheduler" in namespace: "openshift-kube-scheduler" for revision: 5 on node: "ip-10-0-212-204.us-east-2.compute.internal" didn't show up, waited: 3m0s...

Attempting to grab an sosreport failed off of the node, as well as attempting to run oc adm inspect ns/openshift-kube-scheduler

We've _also_ seen this with openshift-kube-controller-manager.

We're also seeing these errors pop up on running clusters as well. The solution is to reboot the control plane node - but that won't help during a managed cluster install as this just fails, and our current recourse is to ask the customer to retry.

We will be implementing a workaround to attempt to have Hive retry these cluster installs as well, but I think the root cause of the issue should still be investigated.

Version-Release number of selected component (if applicable):

We've seen this on 4.13.5 and 4.13.4 clusters since we started seeing this often enough to start collecting data on it.  We will continue to update this JIRA as we find more versions where this is happening.

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

A static pod on a control plane node will die for unexpected reasons, and then the operator is degraded. During install this will cause the install to fail, resulting in a poor customer experience. After install, this causes operators to become degraded and require SRE intervention to reboot the affected node.

Expected results:

Additional info:

I've opened this as a Major level priority due to the impact on customer installs and the resulting customer experience. 

I'm also not sure if "Installer" is the correct component here, since this is a result of the static pods on a control plane node becoming degraded, but since it can happen with, at a minimum, kube-controller-manager and kube-scheduler workloads I'm not sure where the best place to open this card is, I just want to get _something_ on the radar and have a common collection point for any data we can collect from this.  

We're more than happy to get more data around this, especially if there's specific data we can collect to help diagnose this.

causes

HIVE-2281 Hive: retry installation on GeneralOperatorDegraded

To Do

Assignee:: Jan Chaloupka

Reporter:: Kirk Bater

QA Contact:: Gaoyun Pei

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/07/29 12:03 AM

Updated:: 2023/09/15 1:40 PM

Resolved:: 2023/09/15 1:40 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates