Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: ACM 2.12.0
Component/s: Server Foundation
Labels:

Story Points:
3
Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:
RH Private Keywords:

Original story points:
3
Sprint:
SF Train-20
Severity:
Critical

Regression:
None

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem:

While deploying 297 SNOs via Image Based Installer using clusterinstances (with siteconfig operator), 2 clusters fail to even initialize because they are deleted by the system:serviceaccount:multicluster-engine:managedcluster-import-controller-v2 account before they can even begin deploying.

While attempting to determine why these clusters were not deploying I would observe the namespaces being put into terminating state very early in the test (maybe seconds after all clusters were applied by gitops to be deployed) I then search the kube-apiserver audit logs to find what was triggering the delete.

I found the following delete entry for vm00027 and a similar entry for vm00291

$ grep vm00027 e38-h03-kube-apiserver-audit/audit.log | jq ' select(.verb=="delete")'  
{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "996d25bc-af1c-46da-8723-4b9fc49b470f",
  "stage": "ResponseComplete",
  "requestURI": "/api/v1/namespaces/vm00027",
  "verb": "delete",
  "user": {
    "username": "system:serviceaccount:multicluster-engine:managedcluster-import-controller-v2",
    "uid": "0b83233c-4c82-497d-8ef1-6fd02c80c3f8",
    "groups": [
      "system:serviceaccounts",
      "system:serviceaccounts:multicluster-engine",
      "system:authenticated"
    ],
    "extra": {
      "authentication.kubernetes.io/credential-id": [
        "JTI=b3d5e2fb-5389-4667-8d7e-1b9b6476d9a1"
      ],
      "authentication.kubernetes.io/node-name": [
        "e38-h02-000-r650"
      ],
      "authentication.kubernetes.io/node-uid": [
        "fd9a8657-1cee-4023-8038-6db39ecd6a5c"
      ],
      "authentication.kubernetes.io/pod-name": [
        "managedcluster-import-controller-v2-7bf5b6fcfc-9w27b"
      ],
      "authentication.kubernetes.io/pod-uid": [
        "ba97458a-8cce-4548-919a-776e62492ee6"
      ]
    }
  },
  "sourceIPs": [
    "fc00:1005::5"
  ],
  "userAgent": "managedcluster-import-controller/v0.0.0 (linux/amd64) kubernetes/$Format",
  "objectRef": {
    "resource": "namespaces",
    "namespace": "vm00027",
    "name": "vm00027",
    "apiVersion": "v1"
  },
  "responseStatus": {
    "metadata": {},
    "code": 200
  },
  "requestReceivedTimestamp": "2024-10-01T14:09:17.124438Z",
  "stageTimestamp": "2024-10-01T14:09:17.127734Z",
  "annotations": {
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"open-cluster-management:server-foundation:managedcluster-import-controller-v2\" of ClusterRole \"open-cluster-management:server-foundation:managedcluster-import-controller-v2\" to ServiceAccount \"managedcluster-import-controller-v2/multicluster-engine\""
  }
}

This was just what I pulled from the small scale environment and a similar test in the large environment found 67 out of 3672 clusters were missing or about 1.8% of the total clusters missing.

I believe gitops is retrying the clusters and to some degree self healing the issue and preventing the full extent of it. In a separate test without gitops and just oc applying ibi manifests for 3672 SNOs (500/1hr) I observed 346 missing clusters or 9.4% of the clusters never were deployed due to a quick delete right after applying them.

These leads me to believe there is a race condition within the logic of "clusternamespacedeletion-controller" and when clusters are being provisioned from IBI that is causing many cluster's namespaces to end up in terminating and being deleted before they ever can deploy.

Version-Release number of selected component (if applicable):

OCP 4.17.0 (Both hub and spokes)

ACM - 2.12.0-DOWNSTREAM-2024-09-27-14-56-41

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

e38-h02-kube-apiserver-audit.tar.gz
2024/10/01 6:11 PM
130.53 MB
Alex Krzos
e38-h03-kube-apiserver-audit.tar.gz
2024/10/01 6:12 PM
146.80 MB
Alex Krzos
e38-h06-kube-apiserver-audit.tar.gz
2024/10/01 6:11 PM
23.58 MB
Alex Krzos
hub-acm-must-gather.tar.gz
2024/10/01 6:37 PM
123.98 MB
Alex Krzos
hub-must-gather.tar.gz
2024/10/01 6:36 PM
145.18 MB
Alex Krzos
managedcluster-import-controller-v2-7bf5b6fcfc-9w27b.log.gz
2024/10/01 6:12 PM
755 kB
Alex Krzos

Assignee:: Jian Zhu

Reporter:: Alex Krzos

QA Contact:: Hui Chen

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2024/10/01 6:10 PM

Updated:: 2024/10/16 3:33 PM

Resolved:: 2024/10/16 3:33 PM

Estimated:

Remaining:

Logged:

Not Specified

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Time Tracking