Loading...

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: ACM 2.10.0
Affects Version/s: ACM 2.10.0, ACM 2.9.0
Component/s: Application Lifecycle
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
Ansible integration performance enhancement in large scale env
Intelligence Requested:
Market:

Severity:
Moderate

Regression:
No

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

While deploying 3500+ SNO's with the du profile being applied and Ansible Automation Platform running a day2 playbook when clusters become labeled ztp-done=, the multicluster-operators-hub-subscription pod began OOM crashlooping which prevented any new clusters from having the ansiblejob run against them. This ceiling was hit at ~1978 clusters achieving the playbook (labeled ztp-ansible=Completed) but with 3000 clusters initialized for deployment. A complete graph will be available at the end of the test itself.

# oc get po -n open-cluster-management multicluster-operators-hub-subscription-6d6d74ffc9-tt8pj                                                     
NAME                                                       READY   STATUS             RESTARTS         AGE                                                                                   
multicluster-operators-hub-subscription-6d6d74ffc9-tt8pj   0/1     CrashLoopBackOff   26 (3m58s ago)   3h27m

Version-Release number of selected component (if applicable):

Hub OCP - 4.14.2

Deployed SNOs - 4.14.2

ACM - 2.9.0-DOWNSTREAM-2023-11-09-22-25-56

AAP - aap-operator.v2.4.0-0.1698896316

How reproducible:

Steps to Reproduce:

...

Actual results:

Expected results:

Additional info:

Describe of crashlooping pod:

# oc describe po -n open-cluster-management multicluster-operators-hub-subscription-6d6d74ffc9-tt8pj
Name:             multicluster-operators-hub-subscription-6d6d74ffc9-tt8pj
Namespace:        open-cluster-management
Priority:         0
Service Account:  multicluster-operators
Node:             e27-h02-000-r650/fc00:1004::5
Start Time:       Wed, 15 Nov 2023 21:09:19 +0000
Labels:           app=multicluster-operators-hub-subscription
                  ocm-antiaffinity-selector=multicluster-operators-hub-subscription
                  pod-template-hash=6d6d74ffc9
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["fd01:0:0:1::b63/64"],"mac_address":"0a:58:80:97:e1:29","gateway_ips":["fd01:0:0:1::1"],"routes":[{"dest":"fd0...
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "ovn-kubernetes",
                        "interface": "eth0",
                        "ips": [
                            "fd01:0:0:1::b63"
                        ],
                        "mac": "0a:58:80:97:e1:29",
                        "default": true,
                        "dns": {}
                    }]
                  openshift.io/scc: restricted-v2
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:           Running
SeccompProfile:   RuntimeDefault
IP:               fd01:0:0:1::b63
IPs:
  IP:           fd01:0:0:1::b63
Controlled By:  ReplicaSet/multicluster-operators-hub-subscription-6d6d74ffc9
Containers:
  multicluster-operators-hub-subscription:
    Container ID:  cri-o://48140ea5c494f5235724d89db1047b465a5c4914966253f8a771feac6c617aae
    Image:         e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-subscription-rhel8@sha256:53745951f4ec1f22764dcdd5e23284989fb30c8ab093d0d706d69a41c5892c7d
    Image ID:      e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-subscription-rhel8@sha256:0ccc316710f31a9ec9feb50db129819999f4967db2cdaed2d21b495173f4ecbd
    Port:          8443/TCP
    Host Port:     0/TCP
    Command:
      /usr/local/bin/multicluster-operators-subscription
      --sync-interval=60
      --leader-election-lease-duration=137s
      --leader-election-renew-deadline=107s
      --leader-election-retry-period=26s
    State:          Running
      Started:      Thu, 16 Nov 2023 00:37:36 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 16 Nov 2023 00:29:00 +0000
      Finished:     Thu, 16 Nov 2023 00:32:29 +0000
    Ready:          True
    Restart Count:  27
    Limits:
      cpu:     750m
      memory:  2Gi
    Requests:
      cpu:      150m
      memory:   128Mi
    Liveness:   exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
    Readiness:  exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
    Environment:
      WATCH_NAMESPACE:
      POD_NAME:          multicluster-operators-hub-subscription-6d6d74ffc9-tt8pj (v1:metadata.name)
      POD_NAMESPACE:     open-cluster-management (v1:metadata.namespace)
      DEPLOYMENT_LABEL:  multicluster-operators-hub-subscription
      OPERATOR_NAME:     multicluster-operators-hub-subscription
    Mounts:
      /etc/subscription from multicluster-operators-subscription-tls (ro)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lk57f (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  multicluster-operators-subscription-tls:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-lk57f:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/infra:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                      From     Message
  ----     ------     ----                     ----     -------
  Warning  Unhealthy  103m                     kubelet  Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 480d876f97f4c1a8e76a8e7f90146ed23222c486a883d8bf38f44426a9b5793b is running failed: container process not found
  Normal   Pulled     81m (x19 over 3h31m)     kubelet  Container image "e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-subscription-rhel8@sha256:53745951f4ec1f22764dcdd5e23284989fb30c8ab093d0d706d69a41c5892c7d" already present on machine   
  Warning  BackOff    6m34s (x527 over 3h24m)  kubelet  Back-off restarting failed container multicluster-operators-hub-subscription in pod multicluster-operators-hub-subscription-6d6d74ffc9-tt8pj_open-cluster-management(59313c70-cd53-4a0c-98b7-daf526b94ac4)

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

playbook-share-20231116-050625.png
2023/11/16 4:44 PM
123 kB
Alex Krzos

depends on

ACM-9030 Ansible integration performance enhancement in large scale env

Closed

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates