Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-23063

Kube-apiserver is going into CLBO state while bootstrapping on ABI

XMLWordPrintable

    • Important
    • No
    • Sprint 244
    • 1
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      While doing agent based install the cluster is taking too long (approx 20 hours) for installation. 
      
      It is 4.12.23 ABI install.
      
      Here are our observations :
      
      a)	We could observe OS is installed on all the 3 servers but only the master-0 (rendezvous host)  node is accessible through ssh. The other nodes are pingable though.
      
      b)	In the issue state we could see that APIVIP and Ingress VIP ip address were not configured to the main interface of master-0 (rendezvous host) node.
      
      c)	We also observed in the boot up journal log of master-0 (rendezvous host) that bootstrap-kube-controller-manager pod failed to start with crashloopback error as shown below :
      Oct 30 23:46:35 openshift-master-0 kubelet.sh[11122]: E1030 23:46:35.281509   11122 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller
      -manager\" with CrashLoopBackOff: \"back-off 10s restarting failed container=kube-controller-manager pod=bootstrap-kube-controller-manager-openshift-master-0_kube-system(59bfa1a805b5bc1e621
      b485bee8944a6)\"" pod="kube-system/bootstrap-kube-controller-manager-openshift-master-0" podUID=59bfa1a805b5bc1e621b485bee8944a6
      Oct 30 23:46:36 openshift-master-0 kubelet.sh[11122]: E1030 23:46:36.283747   11122 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller
      -manager\" with CrashLoopBackOff: \"back-off 10s restarting failed container=kube-controller-manager pod=bootstrap-kube-controller-manager-openshift-master-0_kube-system(59bfa1a805b5bc1e621
      b485bee8944a6)\"" pod="kube-system/bootstrap-kube-controller-manager-openshift-master-0" podUID=59bfa1a805b5bc1e621b485bee8944a6
      Oct 30 23:46:37 openshift-master-0 kubelet.sh[11122]: E1030 23:46:37.284910   11122 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller
      -manager\" with CrashLoopBackOff: \"back-off 10s restarting failed container=kube-controller-manager pod=bootstrap-kube-controller-manager-openshift-master-0_kube-system(59bfa1a805b5bc1e621
      b485bee8944a6)\"" pod="kube-system/bootstrap-kube-controller-manager-openshift-master-0" podUID=59bfa1a805b5bc1e621b485bee8944a6
      
      d)	We also observed in the background that the following pods failed to start with crash loopback error continuously on master-0 (rendezvous host):
      Oct 31 01:42:32 openshift-master-0 kubelet.sh[11122]: E1031 01:42:32.210639   11122 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-version-operator\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-version-operator pod=bootstrap-cluster-version-operator-openshift-master-0_openshift-cluster-version(1fe02b9e38781cc99a4ffe0e9086726b)\"" pod="openshift-cluster-version/bootstrap-cluster-version-operator-openshift-master-0" podUID=1fe02b9e38781cc99a4ffe0e9086726b
      Oct 31 01:42:33 openshift-master-0 kubelet.sh[11122]: E1031 01:42:33.211270   11122 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-apiserver pod=bootstrap-kube-apiserver-openshift-master-0_openshift-kube-apiserver(f43d4c9183dea070316db1eeec8ee359)\"" pod="openshift-kube-apiserver/bootstrap-kube-apiserver-openshift-master-0" podUID=f43d4c9183dea070316db1eeec8ee359
      Oct 31 01:42:44 openshift-master-0 kubelet.sh[11122]: E1031 01:42:44.211319   11122 pod_workers.go:965] "Error syncing pod, skipping" err="[failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-controller-manager pod=bootstrap-kube-controller-manager-openshift-master-0_kube-system(59bfa1a805b5bc1e621b485bee8944a6)\", failed to \"StartContainer\" for \"cluster-policy-controller\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-policy-controller pod=bootstrap-kube-controller-manager-openshift-master-0_kube-system(59bfa1a805b5bc1e621b485bee8944a6)\"]" pod="kube-system/bootstrap-kube-controller-manager-openshift-master-0" podUID=59bfa1a805b5bc1e621b485bee8944a6
      Oct 31 01:42:45 openshift-master-0 kubelet.sh[11122]: E1031 01:42:45.211215   11122 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-version-operator\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-version-operator pod=bootstrap-cluster-version-operator-openshift-master-0_openshift-cluster-version(1fe02b9e38781cc99a4ffe0e9086726b)\"" pod="openshift-cluster-version/bootstrap-cluster-version-operator-openshift-master-0" podUID=1fe02b9e38781cc99a4ffe0e9086726b
      Oct 31 01:42:56 openshift-master-0 kubelet.sh[11122]: E1031 01:42:56.211955   11122 pod_workers.go:965] "Error syncing pod, skipping" err="[failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-controller-manager pod=bootstrap-kube-controller-manager-openshift-master-0_kube-system(59bfa1a805b5bc1e621b485bee8944a6)\", failed to \"StartContainer\" for \"cluster-policy-controller\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=cluster-policy-controller pod=bootstrap-kube-controller-manager-openshift-master-0_kube-system(59bfa1a805b5bc1e621b485bee8944a6)\"]" pod="kube-system/bootstrap-kube-controller-manager-openshift-master-0" podUID=59bfa1a805b5bc1e621b485bee8944a6
      
      e)	We also observed previously that after approximately 16+ hours the cluster recovers on its own. 

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      Always at customer's end

      Steps to Reproduce:

      1. create agent based ISO
      2. boot the system with ISO
      3. observe installation which eventually fails
      

      Actual results:

      It takes installation to recover in 16+ hours to have cluster installed but again the cluster is unstable

      Expected results:

      Installation should not fail and should not take this much time

      Additional info:

       

       

            bfournie@redhat.com Robert Fournier
            rhn-support-chdeshpa Chinmay Deshpande
            Manoj Hans Manoj Hans
            Chinmay Deshpande
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: