Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-12880

ImagePullBackOff error on klusterlet during cluster registration

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Blocker Blocker
    • None
    • MCE 2.6.1
    • DevOps, Server Foundation
    • None
    • False
    • None
    • False
    • Critical
    • None

      Description of problem:

      After creating a new hive cluster on MCE 2.6.1, the cluster stays stuck in Importing status.
      build: 2.6.1-DOWNANDBACK-2024-07-19-15-41-16

      We see the following error in the import controller log on the mce hub:

      2024-07-23T12:46:05.438218058Z	INFO	manifestwork-controller	Reconciling the manifest works of the managed cluster	{"Request.Name": "clc-az-1721664639215"}
      2024-07-23T12:46:05.501413934Z	ERROR	Reconciler error	{"controller": "manifestwork-controller", "namespace": "", "name": "clc-az-1721664639215", "reconcileID": "962bed92-9a7f-41d0-8b4a-053a79d00609", "error": "manifestworks.work.open-cluster-management.io \"clc-az-1721664639215-klusterlet-crds\" already exists", "errorCauses": [{"error": "manifestworks.work.open-cluster-management.io \"clc-az-1721664639215-klusterlet-crds\" already exists"}]}
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
      	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
      	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
      	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227
      2024-07-23T12:46:05.501523021Z	INFO	manifestwork-controller	Reconciling the manifest works of the managed cluster	{"Request.Name": "clc-az-1721664639215"}
      2024-07-23T12:46:05.508745918Z	INFO	importconfig-controller	Reconciling managed cluster	{"Request.Name": "clc-az-1721664639215"}
      2024-07-23T12:46:05.522439217Z	INFO	manifestwork-controller	Reconciling the manifest works of the managed cluster	{"Request.Name": "clc-az-1721664639215"}
      I0723 12:46:05.551293       1 event.go:364] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"multicluster-engine", Name:"managedcluster-import-controller-v2", UID:"c325d438-2f26-4a92-973f-51743d98fb94", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretUpdated' Updated Secret/bootstrap-hub-kubeconfig -n open-cluster-management-agent because it changed 

      on the managed cluster, I can see the klusterlet pod crashed:

      oc get pods -n open-cluster-management-agent
      NAME                          READY   STATUS             RESTARTS   AGE
      klusterlet-54b6cc6bcd-x67vp   0/1     ImagePullBackOff   0          19h 

      pod:

      Name:             klusterlet-54b6cc6bcd-x67vp
      Namespace:        open-cluster-management-agent
      Priority:         0
      Service Account:  klusterlet
      Node:             clc-az-1721664639215-zkm5f-worker-eastus3-qmph8/10.0.128.4
      Start Time:       Mon, 22 Jul 2024 10:09:58 -0700
      Labels:           app=klusterlet
                        pod-template-hash=54b6cc6bcd
      Annotations:      k8s.ovn.org/pod-networks:
                          {"default":{"ip_addresses":["10.133.2.13/23"],"mac_address":"0a:58:0a:85:02:0d","gateway_ips":["10.133.2.1"],"routes":[{"dest":"10.132.0.0...
                        k8s.v1.cni.cncf.io/network-status:
                          [{
                              "name": "ovn-kubernetes",
                              "interface": "eth0",
                              "ips": [
                                  "10.133.2.13"
                              ],
                              "mac": "0a:58:0a:85:02:0d",
                              "default": true,
                              "dns": {}
                          }]
                        openshift.io/scc: restricted-v2
                        seccomp.security.alpha.kubernetes.io/pod: runtime/default
      Status:           Pending
      SeccompProfile:   RuntimeDefault
      IP:               10.133.2.13
      IPs:
        IP:           10.133.2.13
      Controlled By:  ReplicaSet/klusterlet-54b6cc6bcd
      Containers:
        klusterlet:
          Container ID:
          Image:         registry.redhat.io/multicluster-engine/registration-operator-rhel9@sha256:ace853fde03f1d417522cd47385f6fb78c82bd0a7aa2a7a3fb305c997896dedb
          Image ID:
          Port:          <none>
          Host Port:     <none>
          Args:
            /registration-operator
            klusterlet
            --disable-leader-election
          State:          Waiting
            Reason:       ImagePullBackOff
          Ready:          False
          Restart Count:  0
          Limits:
            memory:  2Gi
          Requests:
            cpu:      50m
            memory:   64Mi
          Liveness:   http-get https://:8443/healthz delay=2s timeout=1s period=10s #success=1 #failure=3
          Readiness:  http-get https://:8443/healthz delay=2s timeout=1s period=10s #success=1 #failure=3
          Environment:
            POD_NAME:  klusterlet-54b6cc6bcd-x67vp (v1:metadata.name)
          Mounts:
            /tmp from tmpdir (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-45b7q (ro)
      Conditions:
        Type                        Status
        PodReadyToStartContainers   True
        Initialized                 True
        Ready                       False
        ContainersReady             False
        PodScheduled                True
      Volumes:
        tmpdir:
          Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
          Medium:
          SizeLimit:  <unset>
        kube-api-access-45b7q:
          Type:                    Projected (a volume that contains injected data from multiple sources)
          TokenExpirationSeconds:  3607
          ConfigMapName:           kube-root-ca.crt
          ConfigMapOptional:       <nil>
          DownwardAPI:             true
          ConfigMapName:           openshift-service-ca.crt
          ConfigMapOptional:       <nil>
      QoS Class:                   Burstable
      Node-Selectors:              <none>
      Tolerations:                 node-role.kubernetes.io/infra:NoSchedule op=Exists
                                   node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                   node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
      Events:
        Type     Reason   Age                   From     Message
        ----     ------   ----                  ----     -------
        Warning  Failed   175m (x203 over 19h)  kubelet  Failed to pull image "registry.redhat.io/multicluster-engine/registration-operator-rhel9@sha256:ace853fde03f1d417522cd47385f6fb78c82bd0a7aa2a7a3fb305c997896dedb": reading manifest sha256:ace853fde03f1d417522cd47385f6fb78c82bd0a7aa2a7a3fb305c997896dedb in registry.redhat.io/multicluster-engine/registration-operator-rhel9: manifest unknown
        Normal   BackOff  31s (x5240 over 19h)  kubelet  Back-off pulling image "registry.redhat.io/multicluster-engine/registration-operator-rhel9@sha256:ace853fde03f1d417522cd47385f6fb78c82bd0a7aa2a7a3fb305c997896dedb" 

      Version-Release number of selected component (if applicable):

      MCE 2.6.1

      How reproducible:

      Steps to Reproduce:

      1. create hive cluster
      2. observe cluster gets stuck
      3. ...

      Actual results:

      Expected results:

      Additional info:

              jiazhu@redhat.com Jian Zhu
              rhn-support-dhuynh David Huynh
              Hui Chen Hui Chen
              ACM QE Team
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: