Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-14616

Clusters deleted before IBIO can begin provisioning

XMLWordPrintable

    • 3
    • False
    • None
    • False
    • 3
    • SF Train-20
    • Critical
    • None

      Description of problem:

      While deploying 297 SNOs via Image Based Installer using clusterinstances (with siteconfig operator), 2 clusters fail to even initialize because they are deleted by the system:serviceaccount:multicluster-engine:managedcluster-import-controller-v2 account before they can even begin deploying.

      While attempting to determine why these clusters were not deploying I would observe the namespaces being put into terminating state very early in the test (maybe seconds after all clusters were applied by gitops to be deployed) I then search the kube-apiserver audit logs to find what was triggering the delete.

      I found the following delete entry for vm00027 and a similar entry for vm00291

      $ grep vm00027 e38-h03-kube-apiserver-audit/audit.log | jq ' select(.verb=="delete")'  
      {
        "kind": "Event",
        "apiVersion": "audit.k8s.io/v1",
        "level": "Metadata",
        "auditID": "996d25bc-af1c-46da-8723-4b9fc49b470f",
        "stage": "ResponseComplete",
        "requestURI": "/api/v1/namespaces/vm00027",
        "verb": "delete",
        "user": {
          "username": "system:serviceaccount:multicluster-engine:managedcluster-import-controller-v2",
          "uid": "0b83233c-4c82-497d-8ef1-6fd02c80c3f8",
          "groups": [
            "system:serviceaccounts",
            "system:serviceaccounts:multicluster-engine",
            "system:authenticated"
          ],
          "extra": {
            "authentication.kubernetes.io/credential-id": [
              "JTI=b3d5e2fb-5389-4667-8d7e-1b9b6476d9a1"
            ],
            "authentication.kubernetes.io/node-name": [
              "e38-h02-000-r650"
            ],
            "authentication.kubernetes.io/node-uid": [
              "fd9a8657-1cee-4023-8038-6db39ecd6a5c"
            ],
            "authentication.kubernetes.io/pod-name": [
              "managedcluster-import-controller-v2-7bf5b6fcfc-9w27b"
            ],
            "authentication.kubernetes.io/pod-uid": [
              "ba97458a-8cce-4548-919a-776e62492ee6"
            ]
          }
        },
        "sourceIPs": [
          "fc00:1005::5"
        ],
        "userAgent": "managedcluster-import-controller/v0.0.0 (linux/amd64) kubernetes/$Format",
        "objectRef": {
          "resource": "namespaces",
          "namespace": "vm00027",
          "name": "vm00027",
          "apiVersion": "v1"
        },
        "responseStatus": {
          "metadata": {},
          "code": 200
        },
        "requestReceivedTimestamp": "2024-10-01T14:09:17.124438Z",
        "stageTimestamp": "2024-10-01T14:09:17.127734Z",
        "annotations": {
          "authorization.k8s.io/decision": "allow",
          "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"open-cluster-management:server-foundation:managedcluster-import-controller-v2\" of ClusterRole \"open-cluster-management:server-foundation:managedcluster-import-controller-v2\" to ServiceAccount \"managedcluster-import-controller-v2/multicluster-engine\""
        }
      }

      This was just what I pulled from the small scale environment and a similar test in the large environment found 67 out of  3672 clusters were missing or about 1.8% of the total clusters missing.

      I believe gitops is retrying the clusters and to some degree self healing the issue and preventing the full extent of it.  In a separate test without gitops and just oc applying ibi manifests for 3672 SNOs (500/1hr) I observed 346 missing clusters or 9.4% of the clusters never were deployed due to a quick delete right after applying them.

      These leads me to believe there is a race condition within the logic of "clusternamespacedeletion-controller" and when clusters are being provisioned from IBI that is causing many cluster's namespaces to end up in terminating and being deleted before they ever can deploy.

      Version-Release number of selected component (if applicable):

      OCP 4.17.0 (Both hub and spokes)

      ACM - 2.12.0-DOWNSTREAM-2024-09-27-14-56-41

      How reproducible:

      Steps to Reproduce:

      1.  
      2.  
      3. ...

      Actual results:

      Expected results:

      Additional info:

              jiazhu@redhat.com Jian Zhu
              akrzos@redhat.com Alex Krzos
              Hui Chen Hui Chen
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved:

                  Estimated:
                  Original Estimate - 3 minutes
                  3m
                  Remaining:
                  Remaining Estimate - 3 minutes
                  3m
                  Logged:
                  Time Spent - Not Specified
                  Not Specified