-
Bug
-
Resolution: Done
-
Critical
-
None
-
ACM 2.12.0
-
3
-
False
-
None
-
False
-
-
-
-
3
-
SF Train-20
-
Critical
-
None
Description of problem:
While deploying 297 SNOs via Image Based Installer using clusterinstances (with siteconfig operator), 2 clusters fail to even initialize because they are deleted by the system:serviceaccount:multicluster-engine:managedcluster-import-controller-v2 account before they can even begin deploying.
While attempting to determine why these clusters were not deploying I would observe the namespaces being put into terminating state very early in the test (maybe seconds after all clusters were applied by gitops to be deployed) I then search the kube-apiserver audit logs to find what was triggering the delete.
I found the following delete entry for vm00027 and a similar entry for vm00291
$ grep vm00027 e38-h03-kube-apiserver-audit/audit.log | jq ' select(.verb=="delete")' { "kind": "Event", "apiVersion": "audit.k8s.io/v1", "level": "Metadata", "auditID": "996d25bc-af1c-46da-8723-4b9fc49b470f", "stage": "ResponseComplete", "requestURI": "/api/v1/namespaces/vm00027", "verb": "delete", "user": { "username": "system:serviceaccount:multicluster-engine:managedcluster-import-controller-v2", "uid": "0b83233c-4c82-497d-8ef1-6fd02c80c3f8", "groups": [ "system:serviceaccounts", "system:serviceaccounts:multicluster-engine", "system:authenticated" ], "extra": { "authentication.kubernetes.io/credential-id": [ "JTI=b3d5e2fb-5389-4667-8d7e-1b9b6476d9a1" ], "authentication.kubernetes.io/node-name": [ "e38-h02-000-r650" ], "authentication.kubernetes.io/node-uid": [ "fd9a8657-1cee-4023-8038-6db39ecd6a5c" ], "authentication.kubernetes.io/pod-name": [ "managedcluster-import-controller-v2-7bf5b6fcfc-9w27b" ], "authentication.kubernetes.io/pod-uid": [ "ba97458a-8cce-4548-919a-776e62492ee6" ] } }, "sourceIPs": [ "fc00:1005::5" ], "userAgent": "managedcluster-import-controller/v0.0.0 (linux/amd64) kubernetes/$Format", "objectRef": { "resource": "namespaces", "namespace": "vm00027", "name": "vm00027", "apiVersion": "v1" }, "responseStatus": { "metadata": {}, "code": 200 }, "requestReceivedTimestamp": "2024-10-01T14:09:17.124438Z", "stageTimestamp": "2024-10-01T14:09:17.127734Z", "annotations": { "authorization.k8s.io/decision": "allow", "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"open-cluster-management:server-foundation:managedcluster-import-controller-v2\" of ClusterRole \"open-cluster-management:server-foundation:managedcluster-import-controller-v2\" to ServiceAccount \"managedcluster-import-controller-v2/multicluster-engine\"" } }
This was just what I pulled from the small scale environment and a similar test in the large environment found 67 out of 3672 clusters were missing or about 1.8% of the total clusters missing.
I believe gitops is retrying the clusters and to some degree self healing the issue and preventing the full extent of it. In a separate test without gitops and just oc applying ibi manifests for 3672 SNOs (500/1hr) I observed 346 missing clusters or 9.4% of the clusters never were deployed due to a quick delete right after applying them.
These leads me to believe there is a race condition within the logic of "clusternamespacedeletion-controller" and when clusters are being provisioned from IBI that is causing many cluster's namespaces to end up in terminating and being deleted before they ever can deploy.
Version-Release number of selected component (if applicable):
OCP 4.17.0 (Both hub and spokes)
ACM - 2.12.0-DOWNSTREAM-2024-09-27-14-56-41
How reproducible:
Steps to Reproduce:
- ...