-
Bug
-
Resolution: Duplicate
-
Undefined
-
4.10
-
Quality / Stability / Reliability
-
None
-
None
-
None
-
Moderate
-
None
-
Unspecified
-
None
-
None
-
None
-
None
-
None
-
If docs needed, set a value
-
None
-
None
-
None
-
None
-
None
Description of problem:
We've come across a pretty rare bug when doing mass installations of SNO clusters that prevents the installation from completing on its own.
The symptoms are as follows:
- The mirror pod for kube-controller-manager is stuck with 3/4 containers ready
- The container that's down is kube-controller-manager
- oc logs ... for that particular container leads to this error:
>[root@e24-h01-000-r640 omer]# oc logs -n openshift-kube-controller-manager kube-controller-manager-sno01271 kube-controller-manager
>Error from server (BadRequest): container "kube-controller-manager" in pod "kube-controller-manager-sno01271" is not available
>[root@e24-h01-000-r640 omer]# oc logs -n openshift-kube-controller-manager kube-controller-manager-sno01271 kube-controller-manager --previous
>Error from server (BadRequest): container "kube-controller-manager" in pod "kube-controller-manager-sno01271" is not available
Noting the cri-o//... ID of this container from the mirror pod's status and then SSHing into the node and running sudo crictl logs ... with that ID we see the culprit:
>E0505 20:36:46.587883 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api/v1/namespaces/kube-system/configmaps/kube-controller-manager?timeout=5s": dial tcp [fc00:1001::8de]:6443: connect: connection refused
>E0505 20:36:47.526268 1 namespaced_resources_deleter.go:161] unable to get all supported resources from server: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api": dial tcp [fc00:1001::8de]:6443: connect: connection refused
>F0505 20:36:47.526298 1 namespaced_resources_deleter.go:164] Unable to get any supported resources from server: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api": dial tcp [fc00:1001::8de]:6443: connect: connection refused
>goroutine 413 [running]:
>k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0x1)
>...
Where fc00:1001::8de is the IPv6 of the single node (it's an IPv6 OVNKubernetes offline cluster, although I don't think that's a crucial detail). The fatal log is coming from these [1] lines.
The connection refused part is not very surprising, since this is a SNO and its API server fluctuates between being up and down, especially during the initial stages of the cluster.
Version-Release number of selected component (if applicable):
4.10.13
How reproducible:
Around 0.087% (usually 1-2 out of 2300) of all installations fail due to this bug
Steps to Reproduce:
1. Run 1000 SNO installations
2. One of them should suffer from this
Actual results:
- kube-controller-manager crashes
- kube-controller-manager doesn't get back up
- The logs for kube-controller-manager cannot be retrieved via `oc logs...`
- The cluster cannot complete installation because kube-controller-manager is down
Expected results:
- kube-controller-manager, like all other controllers, should have better tolerance for API downtime. There was an effort to do so across OpenShift (with library-go) to avoid this kind of crashes, but looks like this condition was missed
- The container should have been restarted by kubelet
- The logs should still be available via oc logs.
- The cluster should complete installation
Additional info:
If you need a live cluster with this bug reproduced please reach out, it can be arranged