-
Bug
-
Resolution: Done
-
Undefined
-
ACM 2.6.4
Description of problem:
While deploying ~2500 SNOs with ACM 2.6.4 for ACM upgrade testing, the managedcluster-import-controller-v2 is OOMing and preventing further clusters from being imported around a scale of ~2400 clusters. Likely we just need the same solution in https://issues.redhat.com/browse/ACM-2275 backported to 2.6
Version-Release number of selected component (if applicable):
ACM - 2.6.4-DOWNSTREAM-2023-01-31-18-35-03
OCP Hub 4.12.1, SNOs 4.12.1
How reproducible:
Steps to Reproduce:
- ...
Actual results:
Expected results:
Additional info:
# oc get po -n multicluster-engine -l app=managedcluster-import-controller-v2 NAME READY STATUS RESTARTS AGE managedcluster-import-controller-v2-6d4bdb4d8-4xm4r 1/1 Running 79 (4m42s ago) 16h managedcluster-import-controller-v2-6d4bdb4d8-pw6pw 1/1 Running 80 (4m4s ago) 16h # oc get managedcluster --no-headers | grep -v Unknown | grep True -c 2413 # oc get deploy -n multicluster-engine managedcluster-import-controller-v2 -o json | jq '.spec.template.spec.containers[0].resources' { "limits": { "cpu": "500m", "memory": "2Gi" }, "requests": { "cpu": "50m", "memory": "96Mi" } }
A describe on one of the pods showing OOM:
# oc describe po -n multicluster-engine managedcluster-import-controller-v2-6d4bdb4d8-4xm4r Name: managedcluster-import-controller-v2-6d4bdb4d8-4xm4r Namespace: multicluster-engine Priority: 0 Service Account: managedcluster-import-controller-v2 Node: e27-h05-000-r650/fc00:1002::7 Start Time: Mon, 13 Feb 2023 21:34:43 +0000 Labels: app=managedcluster-import-controller-v2 ocm-antiaffinity-selector=managedclusterimport pod-template-hash=6d4bdb4d8 Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["fd01:0:0:1::3c/64"],"mac_address":"0a:58:df:cf:6d:db","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "fd01:0:0:1::3c" ], "mac": "0a:58:df:cf:6d:db", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "fd01:0:0:1::3c" ], "mac": "0a:58:df:cf:6d:db", "default": true, "dns": {} }] openshift.io/scc: restricted-v2 scheduler.alpha.kubernetes.io/critical-pod: seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Running IP: fd01:0:0:1::3c IPs: IP: fd01:0:0:1::3c Controlled By: ReplicaSet/managedcluster-import-controller-v2-6d4bdb4d8 Containers: managedcluster-import-controller: Container ID: cri-o://816e98a3aa209f91b5394e7ba8099deca3cee373650ace0a0f69d9d3eb4d266e Image: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/managedcluster-import-controller-rhel8@sha256:4868d67485b6392985a495ae3fc177e4b090fb252fdd6dffc40e353aa0db126d Image ID: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/managedcluster-import-controller-rhel8@sha256:4868d67485b6392985a495ae3fc177e4b090fb252fdd6dffc40e353aa0db126d Port: <none> Host Port: <none> State: Running Started: Tue, 14 Feb 2023 14:51:35 +0000 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Tue, 14 Feb 2023 14:42:27 +0000 Finished: Tue, 14 Feb 2023 14:51:08 +0000 Ready: True Restart Count: 82 Limits: cpu: 500m memory: 2Gi Requests: cpu: 50m memory: 96Mi Environment: WATCH_NAMESPACE: POD_NAME: managedcluster-import-controller-v2-6d4bdb4d8-4xm4r (v1:metadata.name) MAX_CONCURRENT_RECONCILES: 10 OPERATOR_NAME: managedcluster-import-controller DEFAULT_IMAGE_PULL_SECRET: multiclusterhub-operator-pull-secret DEFAULT_IMAGE_REGISTRY: REGISTRATION_OPERATOR_IMAGE: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/registration-operator-rhel8@sha256:85dc5defbf986e36842dfa8cbf3ff764c1a4636779eb6d0bb553bc52263f6867 REGISTRATION_IMAGE: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/registration-rhel8@sha256:112c8f9f6c237dace9f2137525256ac034b8fd7a853ccd00de3963c84799fa3b WORK_IMAGE: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/work-rhel8@sha256:2a1efeda7ce5b3a974078b31b603f157b087d7fc8408e40e4c6e2c03efcf4530 POD_NAMESPACE: multicluster-engine (v1:metadata.namespace) Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9rkdh (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: kube-api-access-9rkdh: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: <none> Tolerations: node-role.kubernetes.io/infra:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Started 42m (x75 over 12h) kubelet Started container managedcluster-import-controller Normal Pulled 15m (x80 over 12h) kubelet Container image "e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/managedcluster-import-controller-rhel8@sha256:4868d67485b6392985a495ae3fc177e4b090fb252fdd6dffc40e353aa0db126d" already present on machine Normal Created 15m (x80 over 12h) kubelet Created container managedcluster-import-controller Warning BackOff 3m40s (x99 over 11h) kubelet Back-off restarting failed container
The attached image of a test result shows a ceiling hit before all provisioned clusters were managed (green managed line never meets red sno_install_completed line)