-
Bug
-
Resolution: Done
-
Undefined
-
ACM 2.10.0, ACM 2.9.0
-
None
Description of problem:
While deploying 3500+ SNO's with the du profile being applied and Ansible Automation Platform running a day2 playbook when clusters become labeled ztp-done=, the multicluster-operators-hub-subscription pod began OOM crashlooping which prevented any new clusters from having the ansiblejob run against them. This ceiling was hit at ~1978 clusters achieving the playbook (labeled ztp-ansible=Completed) but with 3000 clusters initialized for deployment. A complete graph will be available at the end of the test itself.
# oc get po -n open-cluster-management multicluster-operators-hub-subscription-6d6d74ffc9-tt8pj NAME READY STATUS RESTARTS AGE multicluster-operators-hub-subscription-6d6d74ffc9-tt8pj 0/1 CrashLoopBackOff 26 (3m58s ago) 3h27m
Version-Release number of selected component (if applicable):
Hub OCP - 4.14.2
Deployed SNOs - 4.14.2
ACM - 2.9.0-DOWNSTREAM-2023-11-09-22-25-56
AAP - aap-operator.v2.4.0-0.1698896316
How reproducible:
Steps to Reproduce:
- ...
Actual results:
Expected results:
Additional info:
Describe of crashlooping pod:
# oc describe po -n open-cluster-management multicluster-operators-hub-subscription-6d6d74ffc9-tt8pj
Name: multicluster-operators-hub-subscription-6d6d74ffc9-tt8pj
Namespace: open-cluster-management
Priority: 0
Service Account: multicluster-operators
Node: e27-h02-000-r650/fc00:1004::5
Start Time: Wed, 15 Nov 2023 21:09:19 +0000
Labels: app=multicluster-operators-hub-subscription
ocm-antiaffinity-selector=multicluster-operators-hub-subscription
pod-template-hash=6d6d74ffc9
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["fd01:0:0:1::b63/64"],"mac_address":"0a:58:80:97:e1:29","gateway_ips":["fd01:0:0:1::1"],"routes":[{"dest":"fd0...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"fd01:0:0:1::b63"
],
"mac": "0a:58:80:97:e1:29",
"default": true,
"dns": {}
}]
openshift.io/scc: restricted-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Running
SeccompProfile: RuntimeDefault
IP: fd01:0:0:1::b63
IPs:
IP: fd01:0:0:1::b63
Controlled By: ReplicaSet/multicluster-operators-hub-subscription-6d6d74ffc9
Containers:
multicluster-operators-hub-subscription:
Container ID: cri-o://48140ea5c494f5235724d89db1047b465a5c4914966253f8a771feac6c617aae
Image: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-subscription-rhel8@sha256:53745951f4ec1f22764dcdd5e23284989fb30c8ab093d0d706d69a41c5892c7d
Image ID: e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-subscription-rhel8@sha256:0ccc316710f31a9ec9feb50db129819999f4967db2cdaed2d21b495173f4ecbd
Port: 8443/TCP
Host Port: 0/TCP
Command:
/usr/local/bin/multicluster-operators-subscription
--sync-interval=60
--leader-election-lease-duration=137s
--leader-election-renew-deadline=107s
--leader-election-retry-period=26s
State: Running
Started: Thu, 16 Nov 2023 00:37:36 +0000
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Thu, 16 Nov 2023 00:29:00 +0000
Finished: Thu, 16 Nov 2023 00:32:29 +0000
Ready: True
Restart Count: 27
Limits:
cpu: 750m
memory: 2Gi
Requests:
cpu: 150m
memory: 128Mi
Liveness: exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
Readiness: exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
Environment:
WATCH_NAMESPACE:
POD_NAME: multicluster-operators-hub-subscription-6d6d74ffc9-tt8pj (v1:metadata.name)
POD_NAMESPACE: open-cluster-management (v1:metadata.namespace)
DEPLOYMENT_LABEL: multicluster-operators-hub-subscription
OPERATOR_NAME: multicluster-operators-hub-subscription
Mounts:
/etc/subscription from multicluster-operators-subscription-tls (ro)
/tmp from tmp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lk57f (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
multicluster-operators-subscription-tls:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
kube-api-access-lk57f:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/infra:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 103m kubelet Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 480d876f97f4c1a8e76a8e7f90146ed23222c486a883d8bf38f44426a9b5793b is running failed: container process not found
Normal Pulled 81m (x19 over 3h31m) kubelet Container image "e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-subscription-rhel8@sha256:53745951f4ec1f22764dcdd5e23284989fb30c8ab093d0d706d69a41c5892c7d" already present on machine
Warning BackOff 6m34s (x527 over 3h24m) kubelet Back-off restarting failed container multicluster-operators-hub-subscription in pod multicluster-operators-hub-subscription-6d6d74ffc9-tt8pj_open-cluster-management(59313c70-cd53-4a0c-98b7-daf526b94ac4)
- depends on
-
ACM-9030 Ansible integration performance enhancement in large scale env
-
- Closed
-