-
Bug
-
Resolution: Done
-
Blocker
-
None
-
MCE 2.6.1
-
None
-
False
-
-
False
-
-
-
Critical
-
None
Description of problem:
After creating a new hive cluster on MCE 2.6.1, the cluster stays stuck in Importing status.
build: 2.6.1-DOWNANDBACK-2024-07-19-15-41-16
We see the following error in the import controller log on the mce hub:
2024-07-23T12:46:05.438218058Z INFO manifestwork-controller Reconciling the manifest works of the managed cluster {"Request.Name": "clc-az-1721664639215"}
2024-07-23T12:46:05.501413934Z ERROR Reconciler error {"controller": "manifestwork-controller", "namespace": "", "name": "clc-az-1721664639215", "reconcileID": "962bed92-9a7f-41d0-8b4a-053a79d00609", "error": "manifestworks.work.open-cluster-management.io \"clc-az-1721664639215-klusterlet-crds\" already exists", "errorCauses": [{"error": "manifestworks.work.open-cluster-management.io \"clc-az-1721664639215-klusterlet-crds\" already exists"}]}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227
2024-07-23T12:46:05.501523021Z INFO manifestwork-controller Reconciling the manifest works of the managed cluster {"Request.Name": "clc-az-1721664639215"}
2024-07-23T12:46:05.508745918Z INFO importconfig-controller Reconciling managed cluster {"Request.Name": "clc-az-1721664639215"}
2024-07-23T12:46:05.522439217Z INFO manifestwork-controller Reconciling the manifest works of the managed cluster {"Request.Name": "clc-az-1721664639215"}
I0723 12:46:05.551293 1 event.go:364] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"multicluster-engine", Name:"managedcluster-import-controller-v2", UID:"c325d438-2f26-4a92-973f-51743d98fb94", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretUpdated' Updated Secret/bootstrap-hub-kubeconfig -n open-cluster-management-agent because it changed
on the managed cluster, I can see the klusterlet pod crashed:
oc get pods -n open-cluster-management-agent NAME READY STATUS RESTARTS AGE klusterlet-54b6cc6bcd-x67vp 0/1 ImagePullBackOff 0 19h
pod:
Name: klusterlet-54b6cc6bcd-x67vp
Namespace: open-cluster-management-agent
Priority: 0
Service Account: klusterlet
Node: clc-az-1721664639215-zkm5f-worker-eastus3-qmph8/10.0.128.4
Start Time: Mon, 22 Jul 2024 10:09:58 -0700
Labels: app=klusterlet
pod-template-hash=54b6cc6bcd
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.133.2.13/23"],"mac_address":"0a:58:0a:85:02:0d","gateway_ips":["10.133.2.1"],"routes":[{"dest":"10.132.0.0...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.133.2.13"
],
"mac": "0a:58:0a:85:02:0d",
"default": true,
"dns": {}
}]
openshift.io/scc: restricted-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Pending
SeccompProfile: RuntimeDefault
IP: 10.133.2.13
IPs:
IP: 10.133.2.13
Controlled By: ReplicaSet/klusterlet-54b6cc6bcd
Containers:
klusterlet:
Container ID:
Image: registry.redhat.io/multicluster-engine/registration-operator-rhel9@sha256:ace853fde03f1d417522cd47385f6fb78c82bd0a7aa2a7a3fb305c997896dedb
Image ID:
Port: <none>
Host Port: <none>
Args:
/registration-operator
klusterlet
--disable-leader-election
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Limits:
memory: 2Gi
Requests:
cpu: 50m
memory: 64Mi
Liveness: http-get https://:8443/healthz delay=2s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get https://:8443/healthz delay=2s timeout=1s period=10s #success=1 #failure=3
Environment:
POD_NAME: klusterlet-54b6cc6bcd-x67vp (v1:metadata.name)
Mounts:
/tmp from tmpdir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-45b7q (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
tmpdir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-45b7q:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/infra:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Failed 175m (x203 over 19h) kubelet Failed to pull image "registry.redhat.io/multicluster-engine/registration-operator-rhel9@sha256:ace853fde03f1d417522cd47385f6fb78c82bd0a7aa2a7a3fb305c997896dedb": reading manifest sha256:ace853fde03f1d417522cd47385f6fb78c82bd0a7aa2a7a3fb305c997896dedb in registry.redhat.io/multicluster-engine/registration-operator-rhel9: manifest unknown
Normal BackOff 31s (x5240 over 19h) kubelet Back-off pulling image "registry.redhat.io/multicluster-engine/registration-operator-rhel9@sha256:ace853fde03f1d417522cd47385f6fb78c82bd0a7aa2a7a3fb305c997896dedb"
Version-Release number of selected component (if applicable):
MCE 2.6.1
How reproducible:
Steps to Reproduce:
- create hive cluster
- observe cluster gets stuck
- ...