Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: 4.14
Affects Version/s: 4.10
Component/s: Node / Kubelet
Labels:
- migrated_from_bz

Activity Type:
Quality / Stability / Reliability
Blocked:
None
Blocked Reason:
None
Story Points:
None
Severity:
Moderate
Regression:
None
Architecture:

Unspecified

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

Internal Whiteboard:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
If docs needed, set a value
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

We've come across a pretty rare bug when doing mass installations of SNO clusters that prevents the installation from completing on its own.

The symptoms are as follows:

The mirror pod for kube-controller-manager is stuck with 3/4 containers ready
The container that's down is kube-controller-manager
oc logs ... for that particular container leads to this error:

>[root@e24-h01-000-r640 omer]# oc logs -n openshift-kube-controller-manager kube-controller-manager-sno01271 kube-controller-manager
>Error from server (BadRequest): container "kube-controller-manager" in pod "kube-controller-manager-sno01271" is not available
>[root@e24-h01-000-r640 omer]# oc logs -n openshift-kube-controller-manager kube-controller-manager-sno01271 kube-controller-manager --previous
>Error from server (BadRequest): container "kube-controller-manager" in pod "kube-controller-manager-sno01271" is not available

Noting the cri-o//... ID of this container from the mirror pod's status and then SSHing into the node and running sudo crictl logs ... with that ID we see the culprit:

>E0505 20:36:46.587883 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api/v1/namespaces/kube-system/configmaps/kube-controller-manager?timeout=5s": dial tcp [fc00:1001::8de]:6443: connect: connection refused
>E0505 20:36:47.526268 1 namespaced_resources_deleter.go:161] unable to get all supported resources from server: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api": dial tcp [fc00:1001::8de]:6443: connect: connection refused
>F0505 20:36:47.526298 1 namespaced_resources_deleter.go:164] Unable to get any supported resources from server: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api": dial tcp [fc00:1001::8de]:6443: connect: connection refused
>goroutine 413 [running]:
>k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0x1)
>...

Where fc00:1001::8de is the IPv6 of the single node (it's an IPv6 OVNKubernetes offline cluster, although I don't think that's a crucial detail). The fatal log is coming from these [1] lines.

The connection refused part is not very surprising, since this is a SNO and its API server fluctuates between being up and down, especially during the initial stages of the cluster.

Version-Release number of selected component (if applicable):
4.10.13

How reproducible:
Around 0.087% (usually 1-2 out of 2300) of all installations fail due to this bug

Steps to Reproduce:
1. Run 1000 SNO installations
2. One of them should suffer from this

Actual results:

kube-controller-manager crashes

kube-controller-manager doesn't get back up

The logs for kube-controller-manager cannot be retrieved via `oc logs...`

The cluster cannot complete installation because kube-controller-manager is down

Expected results:

kube-controller-manager, like all other controllers, should have better tolerance for API downtime. There was an effort to do so across OpenShift (with library-go) to avoid this kind of crashes, but looks like this condition was missed

The container should have been restarted by kubelet

The logs should still be available via oc logs.

The cluster should complete installation

Additional info:
If you need a live cluster with this bug reproduced please reach out, it can be arranged

[1] https://github.com/openshift/kubernetes/blob/fe7796f337ea0d35bc3e6b5428d63685d1833cb5/pkg/controller/namespace/deletion/namespaced_resources_deleter.go#L159-L165

Assignee:: Ryan Phillips

Reporter:: Omer Tuchfeld

Need Info From:: None

Contributors:: None

QA Contact:: Sunil Choudhary

Doc Contact:: None

Contributing Groups:: Red Hat Employee

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Created:: 2022/05/06 3:47 PM

Updated:: 2025/07/27 5:42 PM

Resolved:: 2023/05/08 3:28 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates