-
Bug
-
Resolution: Done
-
Undefined
-
ACM 2.6.4
-
False
-
None
-
False
-
-
-
No
Description of problem:
While deploying ~2500 SNOs with ACM 2.6.4 for ACM upgrade testing, the ocm-webhook containers are OOMing. Likely we just need the same solution in https://issues.redhat.com/browse/ACM-2305 backported to 2.6
Version-Release number of selected component (if applicable):
ACM - 2.6.4-DOWNSTREAM-2023-01-31-18-35-03
OCP Hub 4.12.1, SNOs 4.12.1
How reproducible:
Steps to Reproduce:
- ...
Actual results:
Expected results:
Additional info:
# oc get po -n multicluster-engine -l control-plane=ocm-webhook NAME READY STATUS RESTARTS AGE ocm-webhook-7bf8855796-6dwxb 1/1 Running 30 (91m ago) 17h ocm-webhook-7bf8855796-vpzvc 1/1 Running 14 (139m ago) 17h # oc get managedcluster --no-headers | grep -v Unknown | grep True -c 2413 # oc get deploy -n multicluster-engine ocm-webhook -o json | jq '.spec.template.spec.containers[0].resources' { "limits": { "memory": "256Mi" }, "requests": { "cpu": "50m", "memory": "128Mi" } }
A describe on the pods show OOM:
# oc describe po -n multicluster-engine ocm-webhook-7bf8855796-6dwxb Name: ocm-webhook-7bf8855796-6dwxb Namespace: multicluster-engine Priority: 0 Service Account: ocm-foundation-sa Node: e27-h03-000-r650/fc00:1002::6 Start Time: Mon, 13 Feb 2023 21:34:43 +0000 Labels: control-plane=ocm-webhook ocm-antiaffinity-selector=ocm-webhook pod-template-hash=7bf8855796 Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["fd01:0:0:3::34/64"],"mac_address":"0a:58:f9:03:bf:bf","gateway_ips":["fd01:0:0:3::1"],"ip_address":"fd01:0:0:... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "fd01:0:0:3::34" ], "mac": "0a:58:f9:03:bf:bf", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "fd01:0:0:3::34" ], "mac": "0a:58:f9:03:bf:bf", "default": true, "dns": {} }] openshift.io/scc: restricted-v2 seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Running IP: fd01:0:0:3::34 IPs: IP: fd01:0:0:3::34 Controlled By: ReplicaSet/ocm-webhook-7bf8855796 Containers: ocm-webhook: Container ID: cri-o://c59972bfa1bfc5da65e62bd7131bef045c0e0fcca5507b4734e1a960487f0d1a Image: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-manager-rhel8@sha256:59d254159c3292763594d496f36c96d995b8a86367ceda1637db49b54fc95a4d Image ID: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-manager-rhel8@sha256:59d254159c3292763594d496f36c96d995b8a86367ceda1637db49b54fc95a4d Port: 8000/TCP Host Port: 0/TCP Args: /webhook --tls-cert-file=/var/run/ocm-webhook/tls.crt --tls-private-key-file=/var/run/ocm-webhook/tls.key State: Running Started: Tue, 14 Feb 2023 13:18:23 +0000 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Tue, 14 Feb 2023 13:12:16 +0000 Finished: Tue, 14 Feb 2023 13:13:14 +0000 Ready: True Restart Count: 30 Limits: memory: 256Mi Requests: cpu: 50m memory: 128Mi Liveness: exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3 Readiness: exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3 Environment: <none> Mounts: /var/run/ocm-webhook from webhook-cert (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-npwdj (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: webhook-cert: Type: Secret (a volume populated by a Secret) SecretName: ocm-webhook Optional: false kube-api-access-npwdj: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: <none> Tolerations: node-role.kubernetes.io/infra:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulled 142m (x19 over 12h) kubelet Container image "e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-manager-rhel8@sha256:59d254159c3292763594d496f36c96d995b8a86367ceda1637db49b54fc95a4d" already present on machine Normal Created 142m (x20 over 17h) kubelet Created container ocm-webhook Normal Started 142m (x20 over 17h) kubelet Started container ocm-webhook Warning Unhealthy 138m (x18 over 12h) kubelet Readiness probe failed: Warning Unhealthy 118m (x23 over 12h) kubelet Liveness probe failed: Warning BackOff 93m (x291 over 12h) kubelet Back-off restarting failed container
It is unclear if these OOMs are causing any fatal issues with the environment at the current time however in the ACM 2.7 testing it was a ceiling for provisioning more clusters. (However for the purposes of this testing we didn't breach that ceiling (yet))