Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-9256

new revision of KCM static pod not created while the old revision gets broken / with wrong status

XMLWordPrintable

    • Quality / Stability / Reliability
    • None
    • None
    • None
    • Moderate
    • None
    • Unspecified
    • None
    • None
    • None
    • None
    • None
    • If docs needed, set a value
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      We've come across a pretty rare bug when doing mass installations of SNO clusters that prevents the installation from completing on its own.

      The symptoms are as follows:

      • The mirror pod for kube-controller-manager is stuck with 3/4 containers ready
      • The container that's down is kube-controller-manager
      • oc logs ... for that particular container leads to this error:

      >[root@e24-h01-000-r640 omer]# oc logs -n openshift-kube-controller-manager kube-controller-manager-sno01271 kube-controller-manager
      >Error from server (BadRequest): container "kube-controller-manager" in pod "kube-controller-manager-sno01271" is not available
      >[root@e24-h01-000-r640 omer]# oc logs -n openshift-kube-controller-manager kube-controller-manager-sno01271 kube-controller-manager --previous
      >Error from server (BadRequest): container "kube-controller-manager" in pod "kube-controller-manager-sno01271" is not available

      Noting the cri-o//... ID of this container from the mirror pod's status and then SSHing into the node and running sudo crictl logs ... with that ID we see the culprit:

      >E0505 20:36:46.587883 1 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api/v1/namespaces/kube-system/configmaps/kube-controller-manager?timeout=5s": dial tcp [fc00:1001::8de]:6443: connect: connection refused
      >E0505 20:36:47.526268 1 namespaced_resources_deleter.go:161] unable to get all supported resources from server: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api": dial tcp [fc00:1001::8de]:6443: connect: connection refused
      >F0505 20:36:47.526298 1 namespaced_resources_deleter.go:164] Unable to get any supported resources from server: Get "https://api-int.sno01271.rdu2.scalelab.redhat.com:6443/api": dial tcp [fc00:1001::8de]:6443: connect: connection refused
      >goroutine 413 [running]:
      >k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0x1)
      >...

      Where fc00:1001::8de is the IPv6 of the single node (it's an IPv6 OVNKubernetes offline cluster, although I don't think that's a crucial detail). The fatal log is coming from these [1] lines.

      The connection refused part is not very surprising, since this is a SNO and its API server fluctuates between being up and down, especially during the initial stages of the cluster.

      Version-Release number of selected component (if applicable):
      4.10.13

      How reproducible:
      Around 0.087% (usually 1-2 out of 2300) of all installations fail due to this bug

      Steps to Reproduce:
      1. Run 1000 SNO installations
      2. One of them should suffer from this

      Actual results:

      • kube-controller-manager crashes
      • kube-controller-manager doesn't get back up
      • The logs for kube-controller-manager cannot be retrieved via `oc logs...`
      • The cluster cannot complete installation because kube-controller-manager is down

      Expected results:

      • kube-controller-manager, like all other controllers, should have better tolerance for API downtime. There was an effort to do so across OpenShift (with library-go) to avoid this kind of crashes, but looks like this condition was missed
      • The container should have been restarted by kubelet
      • The logs should still be available via oc logs.
      • The cluster should complete installation

      Additional info:
      If you need a live cluster with this bug reproduced please reach out, it can be arranged

      [1] https://github.com/openshift/kubernetes/blob/fe7796f337ea0d35bc3e6b5428d63685d1833cb5/pkg/controller/namespace/deletion/namespaced_resources_deleter.go#L159-L165

              rphillip@redhat.com Ryan Phillips
              otuchfel@redhat.com Omer Tuchfeld
              None
              None
              Sunil Choudhary Sunil Choudhary
              None
              Red Hat Employee
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated:
                Resolved: