-
Bug
-
Resolution: Done
-
Major
-
ACM 2.9.0
-
False
-
-
False
-
-
-
Important
-
No
Description of problem:
The multicluster-operators-gitopscluster container is OOM crashlooping in the multicluster-operators-application pod while the large scale environment has deployed and managed 3500+ SNOs.
# oc get po -n open-cluster-management multicluster-operators-application-fbc4696f6-7c5fj NAME READY STATUS RESTARTS AGE multicluster-operators-application-fbc4696f6-7c5fj 3/3 Running 883 (4m48s ago) 4d22h
Count of clusters (Not all are available since some deployed clusters fail to install)
# oc get managedcluster -A --no-headers | wc -l 3619
Version-Release number of selected component (if applicable):
ACM - 2.9.0-DOWNSTREAM-2023-09-01-02-58-15
Hub cluster is OCP 4.13.10
Deployed SNOs were originally 4.12.29 and later upgraded to 4.13.9 (The OOM crash-looping actually started prior to the upgrade)
How reproducible:
Steps to Reproduce:
- ...
Actual results:
Expected results:
Additional info:
Pod description:
oc describe po -n open-cluster-management multicluster-operators-application-fbc4696f6-7c5fj
Name: multicluster-operators-application-fbc4696f6-7c5fj
Namespace: open-cluster-management
Priority: 0
Service Account: multicluster-applications
Node: e27-h05-000-r650/fc00:1004::7
Start Time: Fri, 01 Sep 2023 21:03:24 +0000
Labels: app=multicluster-operators-application
ocm-antiaffinity-selector=multicluster-operators-application
pod-template-hash=fbc4696f6
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["fd01:0:0:1::42/64"],"mac_address":"0a:58:13:d6:9f:ba","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"fd01:0:0:1::42"
],
"mac": "0a:58:13:d6:9f:ba",
"default": true,
"dns": {}
}]
openshift.io/scc: restricted-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Running
IP: fd01:0:0:1::42
IPs:
IP: fd01:0:0:1::42
Controlled By: ReplicaSet/multicluster-operators-application-fbc4696f6
Containers:
multicluster-operators-placementrule:
Container ID: cri-o://776f4936cff86a713c889ca10d1a66b4815cdb25303cbe644564a5479cf7bb83
Image: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-subscription-rhel8@sha256:a92716fa2c4798ceb9a1cc35a84f57874a703c2ba36b8b630603f4c987078a58
Image ID: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-subscription-rhel8@sha256:2b39642174254dc51a044f55c9ca234003c62bb855f5b72eca6c11a5cd845e7d
Port: <none>
Host Port: <none>
Command:
/usr/local/bin/multicluster-operators-placementrule
--alsologtostderr
--leader-election-lease-duration=137s
--leader-election-renew-deadline=107s
--leader-election-retry-period=26s
State: Running
Started: Fri, 01 Sep 2023 21:04:48 +0000
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 01 Sep 2023 21:03:35 +0000
Finished: Fri, 01 Sep 2023 21:04:45 +0000
Ready: True
Restart Count: 1
Limits:
cpu: 1500m
memory: 1536Mi
Requests:
cpu: 300m
memory: 64Mi
Liveness: exec [ls] delay=30s timeout=1s period=30s #success=1 #failure=3
Readiness: exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
Environment:
WATCH_NAMESPACE:
POD_NAME: multicluster-operators-application-fbc4696f6-7c5fj (v1:metadata.name)
POD_NAMESPACE: open-cluster-management (v1:metadata.namespace)
DEPLOYMENT_LABEL: multicluster-operators-placementrule
OPERATOR_NAME: multicluster-operators-application
Mounts:
/tmp from tmp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7bxtz (ro)
multicluster-operators-gitopscluster:
Container ID: cri-o://7b44880d1ac37e1c33bf7df783b010271e6f1f4a1622bda4de4bde407b393250
Image: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-integrations-rhel8@sha256:80476ac2be7f80532025ac426dcde05e196b8ad6d57737e7a2febf4ab16214e4
Image ID: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-integrations-rhel8@sha256:20926f7ace2b880a0fc90251aa98574147737262a3ca6a1022533b0580a9314f
Port: <none>
Host Port: <none>
Command:
/usr/local/bin/gitopscluster
--alsologtostderr
--leader-election-lease-duration=137s
--leader-election-renew-deadline=107s
--leader-election-retry-period=26s
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Wed, 06 Sep 2023 19:52:45 +0000
Finished: Wed, 06 Sep 2023 19:57:29 +0000
Ready: False
Restart Count: 882
Limits:
cpu: 100m
memory: 1Gi
Requests:
cpu: 25m
memory: 64Mi
Liveness: exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
Readiness: exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
Environment:
WATCH_NAMESPACE:
POD_NAME: multicluster-operators-application-fbc4696f6-7c5fj (v1:metadata.name)
POD_NAMESPACE: open-cluster-management (v1:metadata.namespace)
DEPLOYMENT_LABEL: multicluster-operators-gitopscluster
OPERATOR_NAME: multicluster-operators-application
Mounts:
/tmp from tmp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7bxtz (ro)
multicluster-operators-application:
Container ID: cri-o://b7c2c2034019dcab334f7a8886180e2f60a1fc73eedab7b18167bae0d44efb35
Image: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-application-rhel8@sha256:ca0947ce525caa24d34e7a17dc16dedecb1c07b3ad300bea7377bda62cbda4a5
Image ID: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-application-rhel8@sha256:8780b674a6ea8081e5213d1257543b65bb3b177d89ac166148db6434d575f419
Port: 9442/TCP
Host Port: 0/TCP
Command:
/usr/local/bin/multicluster-operators-application
--alsologtostderr
--leader-election-lease-duration=137s
--leader-election-renew-deadline=107s
--leader-election-retry-period=26s
State: Running
Started: Fri, 01 Sep 2023 21:03:36 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 512Mi
Requests:
cpu: 25m
memory: 64Mi
Liveness: exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
Readiness: exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
Environment:
WATCH_NAMESPACE:
POD_NAME: multicluster-operators-application-fbc4696f6-7c5fj (v1:metadata.name)
POD_NAMESPACE: open-cluster-management (v1:metadata.namespace)
DEPLOYMENT_LABEL: multicluster-operators-application
OPERATOR_NAME: multicluster-operators-application
Mounts:
/tmp from tmp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7bxtz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-7bxtz:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/infra:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 53m kubelet Liveness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of ec3c13e66a67c4d13c99b1ee843b3e29904a0389292afbb5842f31e214c74876 is running failed: container process not found
Warning Unhealthy 53m kubelet Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of ec3c13e66a67c4d13c99b1ee843b3e29904a0389292afbb5842f31e214c74876 is running failed: container process not found
Warning BackOff 36s (x13165 over 4d16h) kubelet Back-off restarting failed container multicluster-operators-gitopscluster in pod multicluster-operators-application-fbc4696f6-7c5fj_open-cluster-management(a4d99811-c098-4a94-9731-d695dabae4e9)