-
Bug
-
Resolution: Done
-
Critical
-
None
-
ACM 2.12.0
-
3
-
False
-
-
False
-
-
-
-
3
-
SF Train-20
-
Critical
-
None
Description of problem:
While deploying 297 SNOs via Image Based Installer using clusterinstances (with siteconfig operator), 2 clusters fail to even initialize because they are deleted by the system:serviceaccount:multicluster-engine:managedcluster-import-controller-v2 account before they can even begin deploying.
While attempting to determine why these clusters were not deploying I would observe the namespaces being put into terminating state very early in the test (maybe seconds after all clusters were applied by gitops to be deployed) I then search the kube-apiserver audit logs to find what was triggering the delete.
I found the following delete entry for vm00027 and a similar entry for vm00291
$ grep vm00027 e38-h03-kube-apiserver-audit/audit.log | jq ' select(.verb=="delete")'
{
"kind": "Event",
"apiVersion": "audit.k8s.io/v1",
"level": "Metadata",
"auditID": "996d25bc-af1c-46da-8723-4b9fc49b470f",
"stage": "ResponseComplete",
"requestURI": "/api/v1/namespaces/vm00027",
"verb": "delete",
"user": {
"username": "system:serviceaccount:multicluster-engine:managedcluster-import-controller-v2",
"uid": "0b83233c-4c82-497d-8ef1-6fd02c80c3f8",
"groups": [
"system:serviceaccounts",
"system:serviceaccounts:multicluster-engine",
"system:authenticated"
],
"extra": {
"authentication.kubernetes.io/credential-id": [
"JTI=b3d5e2fb-5389-4667-8d7e-1b9b6476d9a1"
],
"authentication.kubernetes.io/node-name": [
"e38-h02-000-r650"
],
"authentication.kubernetes.io/node-uid": [
"fd9a8657-1cee-4023-8038-6db39ecd6a5c"
],
"authentication.kubernetes.io/pod-name": [
"managedcluster-import-controller-v2-7bf5b6fcfc-9w27b"
],
"authentication.kubernetes.io/pod-uid": [
"ba97458a-8cce-4548-919a-776e62492ee6"
]
}
},
"sourceIPs": [
"fc00:1005::5"
],
"userAgent": "managedcluster-import-controller/v0.0.0 (linux/amd64) kubernetes/$Format",
"objectRef": {
"resource": "namespaces",
"namespace": "vm00027",
"name": "vm00027",
"apiVersion": "v1"
},
"responseStatus": {
"metadata": {},
"code": 200
},
"requestReceivedTimestamp": "2024-10-01T14:09:17.124438Z",
"stageTimestamp": "2024-10-01T14:09:17.127734Z",
"annotations": {
"authorization.k8s.io/decision": "allow",
"authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"open-cluster-management:server-foundation:managedcluster-import-controller-v2\" of ClusterRole \"open-cluster-management:server-foundation:managedcluster-import-controller-v2\" to ServiceAccount \"managedcluster-import-controller-v2/multicluster-engine\""
}
}
This was just what I pulled from the small scale environment and a similar test in the large environment found 67 out of 3672 clusters were missing or about 1.8% of the total clusters missing.
I believe gitops is retrying the clusters and to some degree self healing the issue and preventing the full extent of it. In a separate test without gitops and just oc applying ibi manifests for 3672 SNOs (500/1hr) I observed 346 missing clusters or 9.4% of the clusters never were deployed due to a quick delete right after applying them.
These leads me to believe there is a race condition within the logic of "clusternamespacedeletion-controller" and when clusters are being provisioned from IBI that is causing many cluster's namespaces to end up in terminating and being deleted before they ever can deploy.
Version-Release number of selected component (if applicable):
OCP 4.17.0 (Both hub and spokes)
ACM - 2.12.0-DOWNSTREAM-2024-09-27-14-56-41
How reproducible:
Steps to Reproduce:
- ...