-
Bug
-
Resolution: Done
-
Blocker
-
None
-
MCE 2.6.1
-
None
-
False
-
None
-
False
-
-
-
Critical
-
None
Description of problem:
After creating a new hive cluster on MCE 2.6.1, the cluster stays stuck in Importing status.
build: 2.6.1-DOWNANDBACK-2024-07-19-15-41-16
We see the following error in the import controller log on the mce hub:
2024-07-23T12:46:05.438218058Z INFO manifestwork-controller Reconciling the manifest works of the managed cluster {"Request.Name": "clc-az-1721664639215"} 2024-07-23T12:46:05.501413934Z ERROR Reconciler error {"controller": "manifestwork-controller", "namespace": "", "name": "clc-az-1721664639215", "reconcileID": "962bed92-9a7f-41d0-8b4a-053a79d00609", "error": "manifestworks.work.open-cluster-management.io \"clc-az-1721664639215-klusterlet-crds\" already exists", "errorCauses": [{"error": "manifestworks.work.open-cluster-management.io \"clc-az-1721664639215-klusterlet-crds\" already exists"}]} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227 2024-07-23T12:46:05.501523021Z INFO manifestwork-controller Reconciling the manifest works of the managed cluster {"Request.Name": "clc-az-1721664639215"} 2024-07-23T12:46:05.508745918Z INFO importconfig-controller Reconciling managed cluster {"Request.Name": "clc-az-1721664639215"} 2024-07-23T12:46:05.522439217Z INFO manifestwork-controller Reconciling the manifest works of the managed cluster {"Request.Name": "clc-az-1721664639215"} I0723 12:46:05.551293 1 event.go:364] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"multicluster-engine", Name:"managedcluster-import-controller-v2", UID:"c325d438-2f26-4a92-973f-51743d98fb94", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretUpdated' Updated Secret/bootstrap-hub-kubeconfig -n open-cluster-management-agent because it changed
on the managed cluster, I can see the klusterlet pod crashed:
oc get pods -n open-cluster-management-agent NAME READY STATUS RESTARTS AGE klusterlet-54b6cc6bcd-x67vp 0/1 ImagePullBackOff 0 19h
pod:
Name: klusterlet-54b6cc6bcd-x67vp Namespace: open-cluster-management-agent Priority: 0 Service Account: klusterlet Node: clc-az-1721664639215-zkm5f-worker-eastus3-qmph8/10.0.128.4 Start Time: Mon, 22 Jul 2024 10:09:58 -0700 Labels: app=klusterlet pod-template-hash=54b6cc6bcd Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.133.2.13/23"],"mac_address":"0a:58:0a:85:02:0d","gateway_ips":["10.133.2.1"],"routes":[{"dest":"10.132.0.0... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "10.133.2.13" ], "mac": "0a:58:0a:85:02:0d", "default": true, "dns": {} }] openshift.io/scc: restricted-v2 seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Pending SeccompProfile: RuntimeDefault IP: 10.133.2.13 IPs: IP: 10.133.2.13 Controlled By: ReplicaSet/klusterlet-54b6cc6bcd Containers: klusterlet: Container ID: Image: registry.redhat.io/multicluster-engine/registration-operator-rhel9@sha256:ace853fde03f1d417522cd47385f6fb78c82bd0a7aa2a7a3fb305c997896dedb Image ID: Port: <none> Host Port: <none> Args: /registration-operator klusterlet --disable-leader-election State: Waiting Reason: ImagePullBackOff Ready: False Restart Count: 0 Limits: memory: 2Gi Requests: cpu: 50m memory: 64Mi Liveness: http-get https://:8443/healthz delay=2s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get https://:8443/healthz delay=2s timeout=1s period=10s #success=1 #failure=3 Environment: POD_NAME: klusterlet-54b6cc6bcd-x67vp (v1:metadata.name) Mounts: /tmp from tmpdir (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-45b7q (ro) Conditions: Type Status PodReadyToStartContainers True Initialized True Ready False ContainersReady False PodScheduled True Volumes: tmpdir: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> kube-api-access-45b7q: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: <none> Tolerations: node-role.kubernetes.io/infra:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Failed 175m (x203 over 19h) kubelet Failed to pull image "registry.redhat.io/multicluster-engine/registration-operator-rhel9@sha256:ace853fde03f1d417522cd47385f6fb78c82bd0a7aa2a7a3fb305c997896dedb": reading manifest sha256:ace853fde03f1d417522cd47385f6fb78c82bd0a7aa2a7a3fb305c997896dedb in registry.redhat.io/multicluster-engine/registration-operator-rhel9: manifest unknown Normal BackOff 31s (x5240 over 19h) kubelet Back-off pulling image "registry.redhat.io/multicluster-engine/registration-operator-rhel9@sha256:ace853fde03f1d417522cd47385f6fb78c82bd0a7aa2a7a3fb305c997896dedb"
Version-Release number of selected component (if applicable):
MCE 2.6.1
How reproducible:
Steps to Reproduce:
- create hive cluster
- observe cluster gets stuck
- ...