Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-161

SNO fails to complete install because "Unable to apply 4.11.0: the cluster operator openshift-controller-manager has not yet successfully rolled out"

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • Hide
      12/19: removing from the 4.12 gating list based on test results
      12/15: moved to rank 2 since current test results (on 4.11.13) have not reproduced this issue; planning to run more test iterations & plan to remove this issue from the 4.12 list altogether if those also don't reproduce the issue
      12/12: Telco Scale team evaluating latest test run & gearing up to run w/ proposed fix
      12/8: This will not make 4.12 but will be a release note. Node team has a proposed fix; Telco QE will collaborate to test the fix.
      12/5: Node team engaging upstream on this.
      Rel Note for Telco: Not Required, nothing to say now, it was not reproducible
      Show
      12/19: removing from the 4.12 gating list based on test results 12/15: moved to rank 2 since current test results (on 4.11.13) have not reproduced this issue; planning to run more test iterations & plan to remove this issue from the 4.12 list altogether if those also don't reproduce the issue 12/12: Telco Scale team evaluating latest test run & gearing up to run w/ proposed fix 12/8: This will not make 4.12 but will be a release note. Node team has a proposed fix; Telco QE will collaborate to test the fix. 12/5: Node team engaging upstream on this. Rel Note for Telco: Not Required, nothing to say now, it was not reproducible
    • None
    • None
    • Rejected
    • PM Sync
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Installing 1000+ SNOs via ACM/MCE via ZTP with gitops, a small percentage of clusters end up never completing install because the openshift-controller-manager operator does not reconcile to available.

      Example:

      oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig get co openshift-controller-manager
      NAME                           VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      openshift-controller-manager             False       True          False      17h     Available: no daemon pods available on any node. 

       

      Version-Release number of selected component (if applicable):

      Hub OCP and SNO OCP - 4.11.0

      ACM - 2.6.0-DOWNSTREAM-2022-08-11-23-41-09  (FC5)

       

      How reproducible:

      • 21 out of 23 failures out of 1728 installs
      • ~90% of the failures are because of this issue
      • failure rate of ~1.2% of the total installs

       

      Additional info:

       

       

      # oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       True          18h     Unable to apply 4.11.0: the cluster operator openshift-controller-manager has not yet successfully rolled out  # oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig get co
      NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.11.0    True        False         False      10m     
      baremetal                                  4.11.0    True        False         False      17h     
      cloud-controller-manager                   4.11.0    True        False         False      17h     
      cloud-credential                           4.11.0    True        False         False      17h     
      cluster-autoscaler                         4.11.0    True        False         False      17h     
      config-operator                            4.11.0    True        False         False      18h     
      console                                    4.11.0    True        False         False      16h     
      csi-snapshot-controller                    4.11.0    True        False         False      17h     
      dns                                        4.11.0    True        False         False      17h     
      etcd                                       4.11.0    True        False         False      18h     
      image-registry                             4.11.0    True        False         False      17h     
      ingress                                    4.11.0    True        False         False      17h     
      insights                                   4.11.0    True        False         False      51s     
      kube-apiserver                             4.11.0    True        False         False      17h     
      kube-controller-manager                    4.11.0    True        False         False      17h     
      kube-scheduler                             4.11.0    True        False         False      17h     
      kube-storage-version-migrator              4.11.0    True        False         False      17h     
      machine-api                                4.11.0    True        False         False      17h     
      machine-approver                           4.11.0    True        False         False      17h     
      machine-config                             4.11.0    True        False         False      17h     
      marketplace                                4.11.0    True        False         False      17h     
      monitoring                                 4.11.0    True        False         False      17h     
      network                                    4.11.0    True        False         False      18h     
      node-tuning                                4.11.0    True        False         False      17h     
      openshift-apiserver                        4.11.0    True        False         False      11m     
      openshift-controller-manager                         False       True          False      18h     Available: no daemon pods available on any node.
      openshift-samples                          4.11.0    True        False         False      17h     
      operator-lifecycle-manager                 4.11.0    True        False         False      17h     
      operator-lifecycle-manager-catalog         4.11.0    True        False         False      17h     
      operator-lifecycle-manager-packageserver   4.11.0    True        False         False      17h     
      service-ca                                 4.11.0    True        False         False      18h     
      storage                                    4.11.0    True        False         False      17h
      
      # oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig get po -n openshift-controller-manager
      NAME                       READY   STATUS        RESTARTS   AGE
      controller-manager-26stq   0/1     Terminating   0          17h # oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig describe po -n openshift-controller-manager
      Name:                      controller-manager-26stq
      Namespace:                 openshift-controller-manager
      Priority:                  2000001000
      Priority Class Name:       system-node-critical
      Node:                      sno00090/fc00:1001::441
      Start Time:                Mon, 15 Aug 2022 20:40:49 +0000
      Labels:                    app=openshift-controller-manager
                                 controller-manager=true
                                 controller-revision-hash=5cdd6cbbf4
                                 pod-template-generation=3
      Annotations:               k8s.ovn.org/pod-networks:
                                   {"default":{"ip_addresses":["fd01:0:0:1::26/64"],"mac_address":"0a:58:4c:a7:02:14","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:...
                                 k8s.v1.cni.cncf.io/network-status:
                                   [{
                                       "name": "ovn-kubernetes",
                                       "interface": "eth0",
                                       "ips": [
                                           "fd01:0:0:1::26"
                                       ],
                                       "mac": "0a:58:4c:a7:02:14",
                                       "default": true,
                                       "dns": {}
                                   }]
                                 k8s.v1.cni.cncf.io/networks-status:
                                   [{
                                       "name": "ovn-kubernetes",
                                       "interface": "eth0",
                                       "ips": [
                                           "fd01:0:0:1::26"
                                       ],
                                       "mac": "0a:58:4c:a7:02:14",
                                       "default": true,
                                       "dns": {}
                                   }]
                                 openshift.io/scc: restricted-v2
                                 operator.openshift.io/force: a45bf93a-5310-494f-9b47-efe5e122e439
                                 seccomp.security.alpha.kubernetes.io/pod: runtime/default
      Status:                    Terminating (lasts 17h)
      Termination Grace Period:  30s
      IP:                        fd01:0:0:1::26
      IPs:
        IP:           fd01:0:0:1::26
      Controlled By:  DaemonSet/controller-manager
      Containers:
        controller-manager:
          Container ID:
          Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fead5da1351abee9e77f5c46247c7947d6f9cf76b5bf7a8b4905222fe2564665
          Image ID:
          Port:          8443/TCP
          Host Port:     0/TCP
          Command:
            openshift-controller-manager
            start
          Args:
            --config=/var/run/configmaps/config/config.yaml
            -v=2
          State:          Terminated
            Reason:       ContainerStatusUnknown
            Message:      The container could not be located when the pod was terminated
            Exit Code:    137
            Started:      Mon, 01 Jan 0001 00:00:00 +0000
            Finished:     Mon, 01 Jan 0001 00:00:00 +0000
          Ready:          False
          Restart Count:  0
          Requests:
            cpu:        100m
            memory:     100Mi
          Environment:  <none>
          Mounts:
            /etc/pki/ca-trust/extracted/pem from proxy-ca-bundles (rw)
            /var/run/configmaps/client-ca from client-ca (rw)
            /var/run/configmaps/config from config (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-d6rhr (ro)
            /var/run/secrets/serving-cert from serving-cert (rw)
      Conditions:
        Type              Status
        Initialized       True
        Ready             False
        ContainersReady   False
        PodScheduled      True
      Volumes:
        config:
          Type:      ConfigMap (a volume populated by a ConfigMap)
          Name:      config
          Optional:  false
        client-ca:
          Type:      ConfigMap (a volume populated by a ConfigMap)
          Name:      client-ca
          Optional:  false
        serving-cert:
          Type:        Secret (a volume populated by a Secret)
          SecretName:  serving-cert
          Optional:    false
        proxy-ca-bundles:
          Type:      ConfigMap (a volume populated by a ConfigMap)
          Name:      openshift-global-ca
          Optional:  false
        kube-api-access-d6rhr:
          Type:                    Projected (a volume that contains injected data from multiple sources)
          TokenExpirationSeconds:  3607
          ConfigMapName:           kube-root-ca.crt
          ConfigMapOptional:       <nil>
          DownwardAPI:             true
          ConfigMapName:           openshift-service-ca.crt
          ConfigMapOptional:       <nil>
      QoS Class:                   Burstable
      Node-Selectors:              node-role.kubernetes.io/master=
      Tolerations:                 op=Exists
      Events:
        Type     Reason       Age                    From     Message
        ----     ------       ----                   ----     -------
        Warning  FailedMount  81m (x493 over 17h)    kubelet  MountVolume.SetUp failed for volume "proxy-ca-bundles" : object "openshift-controller-manager"/"openshift-global-ca" not registered
        Warning  FailedMount  63m (x502 over 17h)    kubelet  MountVolume.SetUp failed for volume "kube-api-access-d6rhr" : [object "openshift-controller-manager"/"kube-root-ca.crt" not registered, object "openshift-controller-manager"/"openshift-service-ca.crt" not registered]
        Warning  FailedMount  16m (x525 over 17h)    kubelet  MountVolume.SetUp failed for volume "client-ca" : object "openshift-controller-manager"/"client-ca" not registered
        Warning  FailedMount  12m (x527 over 17h)    kubelet  MountVolume.SetUp failed for volume "serving-cert" : object "openshift-controller-manager"/"serving-cert" not registered
        Warning  FailedMount  2m34s (x532 over 17h)  kubelet  MountVolume.SetUp failed for volume "config" : object "openshift-controller-manager"/"config" not registered 
      
      # oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig logs -n openshift-controller-manager controller-manager-26stq
      Error from server (BadRequest): container "controller-manager" in pod "controller-manager-26stq" is terminated
      
      

       

      Restarting kubelet resolves the issue

      # ssh core@sno00090 sudo systemctl restart kubelet 
      # oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig get clusterversion -w
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       True          18h     Unable to apply 4.11.0: the cluster operator openshift-controller-manager has not yet successfully rolled out
      version             False       True          18h     Working towards 4.11.0: 801 of 802 done (99% complete)
      version   4.11.0    True        False         0s      Cluster version is 4.11.0
      

       

       

              sgrunert@redhat.com Sascha Grunert
              akrzos@redhat.com Alex Krzos
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated:
                Resolved: