Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-7367

multicluster-operators-gitopscluster container OOMing while managing 3500+ SNOs

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Important
    • No

      Description of problem:

      The multicluster-operators-gitopscluster container is OOM crashlooping in the multicluster-operators-application pod while the large scale environment has deployed and managed 3500+ SNOs.

       

       

      # oc get po -n open-cluster-management multicluster-operators-application-fbc4696f6-7c5fj
      NAME                                                 READY   STATUS    RESTARTS          AGE
      multicluster-operators-application-fbc4696f6-7c5fj   3/3     Running   883 (4m48s ago)   4d22h

      Count of clusters (Not all are available since some deployed clusters fail to install)

       

      # oc get managedcluster -A --no-headers | wc -l
      3619

      Version-Release number of selected component (if applicable):

      ACM - 2.9.0-DOWNSTREAM-2023-09-01-02-58-15

      Hub cluster is OCP 4.13.10

      Deployed SNOs were originally 4.12.29 and later upgraded to 4.13.9 (The OOM crash-looping actually started prior to the upgrade)

      How reproducible:

      Steps to Reproduce:

      1.  
      2.  
      3. ...

      Actual results:

      Expected results:

      Additional info:

      Pod description:

      oc describe po -n open-cluster-management multicluster-operators-application-fbc4696f6-7c5fj
      Name:             multicluster-operators-application-fbc4696f6-7c5fj
      Namespace:        open-cluster-management
      Priority:         0
      Service Account:  multicluster-applications
      Node:             e27-h05-000-r650/fc00:1004::7
      Start Time:       Fri, 01 Sep 2023 21:03:24 +0000
      Labels:           app=multicluster-operators-application
                        ocm-antiaffinity-selector=multicluster-operators-application
                        pod-template-hash=fbc4696f6
      Annotations:      k8s.ovn.org/pod-networks:
                          {"default":{"ip_addresses":["fd01:0:0:1::42/64"],"mac_address":"0a:58:13:d6:9f:ba","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:...
                        k8s.v1.cni.cncf.io/network-status:
                          [{
                              "name": "ovn-kubernetes",
                              "interface": "eth0",
                              "ips": [
                                  "fd01:0:0:1::42"
                              ],
                              "mac": "0a:58:13:d6:9f:ba",
                              "default": true,
                              "dns": {}
                          }]
                        openshift.io/scc: restricted-v2
                        seccomp.security.alpha.kubernetes.io/pod: runtime/default
      Status:           Running
      IP:               fd01:0:0:1::42
      IPs:
        IP:           fd01:0:0:1::42
      Controlled By:  ReplicaSet/multicluster-operators-application-fbc4696f6
      Containers:
        multicluster-operators-placementrule:
          Container ID:  cri-o://776f4936cff86a713c889ca10d1a66b4815cdb25303cbe644564a5479cf7bb83
          Image:         e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-subscription-rhel8@sha256:a92716fa2c4798ceb9a1cc35a84f57874a703c2ba36b8b630603f4c987078a58
          Image ID:      e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-subscription-rhel8@sha256:2b39642174254dc51a044f55c9ca234003c62bb855f5b72eca6c11a5cd845e7d
          Port:          <none>
          Host Port:     <none>
          Command:
            /usr/local/bin/multicluster-operators-placementrule
            --alsologtostderr
            --leader-election-lease-duration=137s
            --leader-election-renew-deadline=107s
            --leader-election-retry-period=26s
          State:          Running
            Started:      Fri, 01 Sep 2023 21:04:48 +0000
          Last State:     Terminated
            Reason:       Error
            Exit Code:    1
            Started:      Fri, 01 Sep 2023 21:03:35 +0000
            Finished:     Fri, 01 Sep 2023 21:04:45 +0000
          Ready:          True
          Restart Count:  1
          Limits:
            cpu:     1500m
            memory:  1536Mi
          Requests:
            cpu:      300m
            memory:   64Mi
          Liveness:   exec [ls] delay=30s timeout=1s period=30s #success=1 #failure=3
          Readiness:  exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
          Environment:
            WATCH_NAMESPACE:
            POD_NAME:          multicluster-operators-application-fbc4696f6-7c5fj (v1:metadata.name)
            POD_NAMESPACE:     open-cluster-management (v1:metadata.namespace)
            DEPLOYMENT_LABEL:  multicluster-operators-placementrule
            OPERATOR_NAME:     multicluster-operators-application
          Mounts:
            /tmp from tmp (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7bxtz (ro)
        multicluster-operators-gitopscluster:
          Container ID:  cri-o://7b44880d1ac37e1c33bf7df783b010271e6f1f4a1622bda4de4bde407b393250
          Image:         e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-integrations-rhel8@sha256:80476ac2be7f80532025ac426dcde05e196b8ad6d57737e7a2febf4ab16214e4
          Image ID:      e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-integrations-rhel8@sha256:20926f7ace2b880a0fc90251aa98574147737262a3ca6a1022533b0580a9314f
          Port:          <none>
          Host Port:     <none>
          Command:
            /usr/local/bin/gitopscluster
            --alsologtostderr
            --leader-election-lease-duration=137s
            --leader-election-renew-deadline=107s
            --leader-election-retry-period=26s
          State:          Waiting
            Reason:       CrashLoopBackOff
          Last State:     Terminated
            Reason:       OOMKilled
            Exit Code:    137
            Started:      Wed, 06 Sep 2023 19:52:45 +0000
            Finished:     Wed, 06 Sep 2023 19:57:29 +0000
          Ready:          False
          Restart Count:  882
          Limits:
            cpu:     100m
            memory:  1Gi
          Requests:
            cpu:      25m
            memory:   64Mi
          Liveness:   exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
          Readiness:  exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
          Environment:
            WATCH_NAMESPACE:
            POD_NAME:          multicluster-operators-application-fbc4696f6-7c5fj (v1:metadata.name)
            POD_NAMESPACE:     open-cluster-management (v1:metadata.namespace)
            DEPLOYMENT_LABEL:  multicluster-operators-gitopscluster
            OPERATOR_NAME:     multicluster-operators-application
          Mounts:
            /tmp from tmp (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7bxtz (ro)
        multicluster-operators-application:
          Container ID:  cri-o://b7c2c2034019dcab334f7a8886180e2f60a1fc73eedab7b18167bae0d44efb35
          Image:         e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-application-rhel8@sha256:ca0947ce525caa24d34e7a17dc16dedecb1c07b3ad300bea7377bda62cbda4a5
          Image ID:      e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-application-rhel8@sha256:8780b674a6ea8081e5213d1257543b65bb3b177d89ac166148db6434d575f419
          Port:          9442/TCP
          Host Port:     0/TCP
          Command:
            /usr/local/bin/multicluster-operators-application
            --alsologtostderr
            --leader-election-lease-duration=137s
            --leader-election-renew-deadline=107s
            --leader-election-retry-period=26s
          State:          Running
            Started:      Fri, 01 Sep 2023 21:03:36 +0000
          Ready:          True
          Restart Count:  0
          Limits:
            cpu:     100m
            memory:  512Mi
          Requests:
            cpu:      25m
            memory:   64Mi
          Liveness:   exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
          Readiness:  exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
          Environment:
            WATCH_NAMESPACE:
            POD_NAME:          multicluster-operators-application-fbc4696f6-7c5fj (v1:metadata.name)
            POD_NAMESPACE:     open-cluster-management (v1:metadata.namespace)
            DEPLOYMENT_LABEL:  multicluster-operators-application
            OPERATOR_NAME:     multicluster-operators-application
          Mounts:
            /tmp from tmp (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7bxtz (ro)
      Conditions:
        Type              Status
        Initialized       True
        Ready             False
        ContainersReady   False
        PodScheduled      True
      Volumes:
        tmp:
          Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
          Medium:
          SizeLimit:  <unset>
        kube-api-access-7bxtz:
          Type:                    Projected (a volume that contains injected data from multiple sources)
          TokenExpirationSeconds:  3607
          ConfigMapName:           kube-root-ca.crt
          ConfigMapOptional:       <nil>
          DownwardAPI:             true
          ConfigMapName:           openshift-service-ca.crt
          ConfigMapOptional:       <nil>
      QoS Class:                   Burstable
      Node-Selectors:              <none>
      Tolerations:                 node-role.kubernetes.io/infra:NoSchedule op=Exists
                                   node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                   node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
      Events:
        Type     Reason     Age                      From     Message
        ----     ------     ----                     ----     -------
        Warning  Unhealthy  53m                      kubelet  Liveness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of ec3c13e66a67c4d13c99b1ee843b3e29904a0389292afbb5842f31e214c74876 is running failed: container process not found
        Warning  Unhealthy  53m                      kubelet  Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of ec3c13e66a67c4d13c99b1ee843b3e29904a0389292afbb5842f31e214c74876 is running failed: container process not found
        Warning  BackOff    36s (x13165 over 4d16h)  kubelet  Back-off restarting failed container multicluster-operators-gitopscluster in pod multicluster-operators-application-fbc4696f6-7c5fj_open-cluster-management(a4d99811-c098-4a94-9731-d695dabae4e9)
      

       

              phwu@redhat.com Philip Wu
              akrzos@redhat.com Alex Krzos
              David Huynh David Huynh
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: