-
Bug
-
Resolution: Done
-
Undefined
-
ACM 2.6.4
-
False
-
-
False
-
-
-
No
Description of problem:
While deploying ~2500 SNOs with ACM 2.6.4 for ACM upgrade testing, the ocm-webhook containers are OOMing. Likely we just need the same solution in https://issues.redhat.com/browse/ACM-2305 backported to 2.6
Version-Release number of selected component (if applicable):
ACM - 2.6.4-DOWNSTREAM-2023-01-31-18-35-03
OCP Hub 4.12.1, SNOs 4.12.1
How reproducible:
Steps to Reproduce:
- ...
Actual results:
Expected results:
Additional info:
# oc get po -n multicluster-engine -l control-plane=ocm-webhook
NAME READY STATUS RESTARTS AGE
ocm-webhook-7bf8855796-6dwxb 1/1 Running 30 (91m ago) 17h
ocm-webhook-7bf8855796-vpzvc 1/1 Running 14 (139m ago) 17h
# oc get managedcluster --no-headers | grep -v Unknown | grep True -c
2413
# oc get deploy -n multicluster-engine ocm-webhook -o json | jq '.spec.template.spec.containers[0].resources'
{
"limits": {
"memory": "256Mi"
},
"requests": {
"cpu": "50m",
"memory": "128Mi"
}
}
A describe on the pods show OOM:
# oc describe po -n multicluster-engine ocm-webhook-7bf8855796-6dwxb
Name: ocm-webhook-7bf8855796-6dwxb
Namespace: multicluster-engine
Priority: 0
Service Account: ocm-foundation-sa
Node: e27-h03-000-r650/fc00:1002::6
Start Time: Mon, 13 Feb 2023 21:34:43 +0000
Labels: control-plane=ocm-webhook
ocm-antiaffinity-selector=ocm-webhook
pod-template-hash=7bf8855796
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["fd01:0:0:3::34/64"],"mac_address":"0a:58:f9:03:bf:bf","gateway_ips":["fd01:0:0:3::1"],"ip_address":"fd01:0:0:...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"fd01:0:0:3::34"
],
"mac": "0a:58:f9:03:bf:bf",
"default": true,
"dns": {}
}]
k8s.v1.cni.cncf.io/networks-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"fd01:0:0:3::34"
],
"mac": "0a:58:f9:03:bf:bf",
"default": true,
"dns": {}
}]
openshift.io/scc: restricted-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Running
IP: fd01:0:0:3::34
IPs:
IP: fd01:0:0:3::34
Controlled By: ReplicaSet/ocm-webhook-7bf8855796
Containers:
ocm-webhook:
Container ID: cri-o://c59972bfa1bfc5da65e62bd7131bef045c0e0fcca5507b4734e1a960487f0d1a
Image: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-manager-rhel8@sha256:59d254159c3292763594d496f36c96d995b8a86367ceda1637db49b54fc95a4d
Image ID: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-manager-rhel8@sha256:59d254159c3292763594d496f36c96d995b8a86367ceda1637db49b54fc95a4d
Port: 8000/TCP
Host Port: 0/TCP
Args:
/webhook
--tls-cert-file=/var/run/ocm-webhook/tls.crt
--tls-private-key-file=/var/run/ocm-webhook/tls.key
State: Running
Started: Tue, 14 Feb 2023 13:18:23 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Tue, 14 Feb 2023 13:12:16 +0000
Finished: Tue, 14 Feb 2023 13:13:14 +0000
Ready: True
Restart Count: 30
Limits:
memory: 256Mi
Requests:
cpu: 50m
memory: 128Mi
Liveness: exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
Readiness: exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/ocm-webhook from webhook-cert (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-npwdj (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
webhook-cert:
Type: Secret (a volume populated by a Secret)
SecretName: ocm-webhook
Optional: false
kube-api-access-npwdj:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/infra:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 142m (x19 over 12h) kubelet Container image "e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-manager-rhel8@sha256:59d254159c3292763594d496f36c96d995b8a86367ceda1637db49b54fc95a4d" already present on machine
Normal Created 142m (x20 over 17h) kubelet Created container ocm-webhook
Normal Started 142m (x20 over 17h) kubelet Started container ocm-webhook
Warning Unhealthy 138m (x18 over 12h) kubelet Readiness probe failed:
Warning Unhealthy 118m (x23 over 12h) kubelet Liveness probe failed:
Warning BackOff 93m (x291 over 12h) kubelet Back-off restarting failed container
It is unclear if these OOMs are causing any fatal issues with the environment at the current time however in the ACM 2.7 testing it was a ceiling for provisioning more clusters. (However for the purposes of this testing we didn't breach that ceiling (yet))