Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3770

cvo pod crashloop during bootstrap: featuregates: connection refused

XMLWordPrintable

    • Moderate
    • None
    • OTA 227
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      cluster-version-operator pod crashloop during the bootstrap process might be leading to a longer bootstrap process causing the installer to timeout and fail.
      
      The cluster-version-operator pod is continuously restarting due to a go panic. The bootstrap process fails due to the timeout although it completes the process correctly after more time, once the cluster-version-operator pod runs correctly.
      
      $ oc -n openshift-cluster-version logs -p cluster-version-operator-754498df8b-5gll8
      I0919 10:25:05.790124       1 start.go:23] ClusterVersionOperator 4.12.0-202209161347.p0.gc4fd1f4.assembly.stream-c4fd1f4                                                                                                                    
      F0919 10:25:05.791580       1 start.go:29] error: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused                                                        
      goroutine 1 [running]:
      k8s.io/klog/v2.stacks(0x1)
              /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:860 +0x8a
      k8s.io/klog/v2.(*loggingT).output(0x2bee180, 0x3, 0x0, 0xc00017d5e0, 0x1, {0x22e9abc?, 0x1?}, 0x2beed80?, 0x0)
              /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:825 +0x686
      k8s.io/klog/v2.(*loggingT).printfDepth(0x2bee180, 0x0?, 0x0, {0x0, 0x0}, 0x1?, {0x1b9cff0, 0x9}, {0xc000089140, 0x1, ...})                                                                                                                   
              /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2
      k8s.io/klog/v2.(*loggingT).printf(...)
              /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:612
      k8s.io/klog/v2.Fatalf(...)
              /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1516
      main.init.3.func1(0xc00012ac80?, {0x1b96f60?, 0x6?, 0x6?})
              /go/src/github.com/openshift/cluster-version-operator/cmd/start.go:29 +0x1e6
      github.com/spf13/cobra.(*Command).execute(0xc00012ac80, {0xc0002fea20, 0x6, 0x6})
              /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663
      github.com/spf13/cobra.(*Command).ExecuteC(0x2bd52a0)
              /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
      github.com/spf13/cobra.(*Command).Execute(...)
              /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:902
      main.main()
              /go/src/github.com/openshift/cluster-version-operator/cmd/main.go:29 +0x46

      Version-Release number of selected component (if applicable):

      4.12.0-0.nightly-2022-09-18-234318

      How reproducible:

      Most of the times, with any network type and installation type (IPI, UPI and proxy).

      Steps to Reproduce:

      1. Install OCP 4.12 IPI
         $ openshift-install create cluster
      2. Wait until bootstrap is completed
      

      Actual results:

      [...]
      level=error msg="Bootstrap failed to complete: timed out waiting for the condition"
      level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane."
      
      NAMESPACE                                          NAME                                                         READY   STATUS             RESTARTS        AGE 
      openshift-cluster-version                          cluster-version-operator-754498df8b-5gll8                    0/1     CrashLoopBackOff   7 (3m21s ago)   24m 
      openshift-image-registry                           image-registry-94fd8b75c-djbxb                               0/1     Pending            0               6m44s 
      openshift-image-registry                           image-registry-94fd8b75c-ft66c                               0/1     Pending            0               6m44s 
      openshift-ingress                                  router-default-64fbb749b4-cmqgw                              0/1     Pending            0               13m   
      openshift-ingress                                  router-default-64fbb749b4-mhtqx                              0/1     Pending            0               13m   
      openshift-monitoring                               prometheus-operator-admission-webhook-6d8cb95cf7-6jn5q       0/1     Pending            0               14m 
      openshift-monitoring                               prometheus-operator-admission-webhook-6d8cb95cf7-r6nnk       0/1     Pending            0               14m 
      openshift-network-diagnostics                      network-check-source-8758bd6fc-vzf5k                         0/1     Pending            0               18m 
      openshift-operator-lifecycle-manager               collect-profiles-27726375-hlq89                              0/1     Pending            0               21m 
      $ oc -n openshift-cluster-version describe pod cluster-version-operator-754498df8b-5gll8
      Name:                 cluster-version-operator-754498df8b-5gll8
      Namespace:            openshift-cluster-version                                                            
      Priority:             2000000000              
      Priority Class Name:  system-cluster-critical                                                       
      Node:                 ostest-4gtwr-master-1/10.196.0.68
      Start Time:           Mon, 19 Sep 2022 10:17:41 +0000                       
      Labels:               k8s-app=cluster-version-operator
                            pod-template-hash=754498df8b
      Annotations:          openshift.io/scc: hostaccess 
      Status:               Running                      
      IP:                   10.196.0.68
      IPs:                 
        IP:           10.196.0.68
      Controlled By:  ReplicaSet/cluster-version-operator-754498df8b
      Containers:        
        cluster-version-operator:
          Container ID:  cri-o://1e2879600c89baabaca68c1d4d0a563d4b664c507f0617988cbf9ea7437f0b27
          Image:         registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69                                                                                                             
          Image ID:      registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69
          Port:          <none>                                                                                                                                                                                                                    
          Host Port:     <none>                                                                                                                                                                                                                    
          Args:                                                     
            start                                                                                                                                                                                                                                  
            --release-image=registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69                                                                                                          
            --enable-auto-update=false                                                                                                                                                                                                             
            --listen=0.0.0.0:9099                                                  
            --serving-cert-file=/etc/tls/serving-cert/tls.crt
            --serving-key-file=/etc/tls/serving-cert/tls.key                                                                                                                                                                                       
            --v=2             
          State:       Waiting 
            Reason:    CrashLoopBackOff
          Last State:  Terminated
            Reason:    Error
            Message:   I0919 10:33:07.798614       1 start.go:23] ClusterVersionOperator 4.12.0-202209161347.p0.gc4fd1f4.assembly.stream-c4fd1f4
      F0919 10:33:07.800115       1 start.go:29] error: Get "https://127.0.0.1:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp 127.0.0.1:6443: connect: connection refused
      goroutine 1 [running]:                                                                                                                                                                                                                [43/497]
      k8s.io/klog/v2.stacks(0x1)
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:860 +0x8a
      k8s.io/klog/v2.(*loggingT).output(0x2bee180, 0x3, 0x0, 0xc000433ea0, 0x1, {0x22e9abc?, 0x1?}, 0x2beed80?, 0x0)
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:825 +0x686
      k8s.io/klog/v2.(*loggingT).printfDepth(0x2bee180, 0x0?, 0x0, {0x0, 0x0}, 0x1?, {0x1b9cff0, 0x9}, {0xc0002d6630, 0x1, ...})
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:630 +0x1f2
      k8s.io/klog/v2.(*loggingT).printf(...)
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:612
      k8s.io/klog/v2.Fatalf(...)
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1516
      main.init.3.func1(0xc0003b4f00?, {0x1b96f60?, 0x6?, 0x6?})
        /go/src/github.com/openshift/cluster-version-operator/cmd/start.go:29 +0x1e6
      github.com/spf13/cobra.(*Command).execute(0xc0003b4f00, {0xc000311980, 0x6, 0x6})
        /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:860 +0x663
      github.com/spf13/cobra.(*Command).ExecuteC(0x2bd52a0)
        /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:974 +0x3b4
      github.com/spf13/cobra.(*Command).Execute(...)
        /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:902
      main.main()
        /go/src/github.com/openshift/cluster-version-operator/cmd/main.go:29 +0x46      Exit Code:    255
            Started:      Mon, 19 Sep 2022 10:33:07 +0000
            Finished:     Mon, 19 Sep 2022 10:33:07 +0000
          Ready:          False
          Restart Count:  7
          Requests:
            cpu:     20m
            memory:  50Mi
          Environment:
            KUBERNETES_SERVICE_PORT:  6443
            KUBERNETES_SERVICE_HOST:  127.0.0.1
            NODE_NAME:                 (v1:spec.nodeName)
            CLUSTER_PROFILE:          self-managed-high-availability
          Mounts:
            /etc/cvo/updatepayloads from etc-cvo-updatepayloads (ro)
            /etc/ssl/certs from etc-ssl-certs (ro)
            /etc/tls/service-ca from service-ca (ro)
            /etc/tls/serving-cert from serving-cert (ro)
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access (ro)
      onditions:
        Type              Status
        Initialized       True
        Ready             False
        ContainersReady   False
        PodScheduled      True
      Volumes:
        etc-ssl-certs:
          Type:          HostPath (bare host directory volume)
          Path:          /etc/ssl/certs
          HostPathType:
        etc-cvo-updatepayloads:
          Type:          HostPath (bare host directory volume)
          Path:          /etc/cvo/updatepayloads
          HostPathType:
        serving-cert:
          Type:        Secret (a volume populated by a Secret)
          SecretName:  cluster-version-operator-serving-cert
          Optional:    false
        service-ca:
          Type:      ConfigMap (a volume populated by a ConfigMap)
          Name:      openshift-service-ca.crt
          Optional:  false
        kube-api-access:
          Type:                    Projected (a volume that contains injected data from multiple sources)
          TokenExpirationSeconds:  3600
          ConfigMapName:           kube-root-ca.crt
          ConfigMapOptional:       <nil>
          DownwardAPI:             true
      QoS Class:                   Burstable
      Node-Selectors:              node-role.kubernetes.io/master=
      Tolerations:                 node-role.kubernetes.io/master:NoSchedule op=Exists
                                   node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                                   node.kubernetes.io/not-ready:NoSchedule op=Exists
                                   node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                                   node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
      Events:
        Type     Reason            Age                   From               Message
        ----     ------            ----                  ----               -------
        Warning  FailedScheduling  25m                   default-scheduler  no nodes available to schedule pods
        Warning  FailedScheduling  21m                   default-scheduler  0/2 nodes are available: 2 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}. preemption: 0/2 nodes are available: 2 Preemption is no
      t helpful for scheduling.
        Normal   Scheduled         19m                   default-scheduler  Successfully assigned openshift-cluster-version/cluster-version-operator-754498df8b-5gll8 to ostest-4gtwr-master-1 by ostest-4gtwr-bootstrap
        Warning  FailedMount       17m                   kubelet            Unable to attach or mount volumes: unmounted volumes=[serving-cert], unattached volumes=[service-ca kube-api-access etc-ssl-certs etc-cvo-updatepayloads serving-cert]:
      timed out waiting for the condition
        Warning  FailedMount       17m (x9 over 19m)     kubelet            MountVolume.SetUp failed for volume "serving-cert" : secret "cluster-version-operator-serving-cert" not found
        Normal   Pulling           15m                   kubelet            Pulling image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69"
        Normal   Pulled            15m                   kubelet            Successfully pulled image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69" in 7.481824271s
        Normal   Started           14m (x3 over 15m)     kubelet            Started container cluster-version-operator
        Normal   Created           14m (x4 over 15m)     kubelet            Created container cluster-version-operator
        Normal   Pulled            14m (x3 over 15m)     kubelet            Container image "registry.ci.openshift.org/ocp/release@sha256:2e38cd73b402a990286829aebdf00aa67a5b99124c61ec2f4fccd1135a1f0c69" already present on machine
        Warning  BackOff           4m22s (x52 over 15m)  kubelet            Back-off restarting failed container
        
        

      Expected results:

      No panic?

      Additional info:

      Seen in most of OCP on OSP QE CI jobs.

      Attached [^must-gather-install.tar.gz]

              afri@afri.cz Petr Muller
              juriarte@redhat.com Jon Uriarte
              Yang Yang Yang Yang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: