Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-4768

argocd-pull-integration-controller-manager OOM with 1246 managed clusters

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • Integration
    • False
    • Hide

      None

      Show
      None
    • False
    • No

      Description of problem:

      The argocd-pull-integration-controller-manager container is OOM crashlooping while managing 1246 clusters.  Clusters are a mix of SNO, compact and standard "types".

      Version-Release number of selected component (if applicable):

      ACM Hub - 4.12.10

      ACM - 2.8.0-DOWNSTREAM-2023-04-04-01-46-55

      Managed cluster OCP 4.12.10

      How reproducible:

      Steps to Reproduce:

      1.  
      2.  
      3. ...

      Actual results:

      Expected results:

      Additional info:

       

      # oc get po -n open-cluster-management multicluster-integrations-75d6547fdf-q7mcv
      NAME                                         READY   STATUS             RESTARTS        AGE
      multicluster-integrations-75d6547fdf-q7mcv   2/3     CrashLoopBackOff   180 (36s ago)   16h
      # oc describe po -n open-cluster-management multicluster-integrations-75d6547fdf-q7mcv
      Name:             multicluster-integrations-75d6547fdf-q7mcv
      Namespace:        open-cluster-management
      Priority:         0
      Service Account:  multicluster-applications
      Node:             e27-h03-000-r650/fc00:1004::6
      Start Time:       Tue, 04 Apr 2023 22:34:14 +0000
      Labels:           name=multicluster-integrations
                        ocm-antiaffinity-selector=multicluster-integrations
                        pod-template-hash=75d6547fdf
      Annotations:      k8s.ovn.org/pod-networks:
                          {"default":{"ip_addresses":["fd01:0:0:3::23/64"],"mac_address":"0a:58:10:b9:61:b7","gateway_ips":["fd01:0:0:3::1"],"ip_address":"fd01:0:0:...
                        k8s.v1.cni.cncf.io/network-status:
                          [{
                              "name": "ovn-kubernetes",
                              "interface": "eth0",
                              "ips": [
                                  "fd01:0:0:3::23"
                              ],
                              "mac": "0a:58:10:b9:61:b7",
                              "default": true,
                              "dns": {}
                          }]
                        k8s.v1.cni.cncf.io/networks-status:
                          [{
                              "name": "ovn-kubernetes",
                              "interface": "eth0",
                              "ips": [
                                  "fd01:0:0:3::23"
                              ],
                              "mac": "0a:58:10:b9:61:b7",
                              "default": true,
                              "dns": {}
                          }]
                        openshift.io/scc: restricted-v2
                        seccomp.security.alpha.kubernetes.io/pod: runtime/default
      Status:           Running
      IP:               fd01:0:0:3::23
      IPs:
        IP:           fd01:0:0:3::23
      Controlled By:  ReplicaSet/multicluster-integrations-75d6547fdf
      Containers:
        argocd-pull-integration-controller-manager:
          Container ID:  cri-o://4813c2b6afcdc4ca547effd30504158853edd162d1b36b4504d83d0eb1b95452
          Image:         e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-integrations-rhel8@sha256:9227443a4a57c432f48301019af17691c9070778210a3f22425bd4d8f85bcc29
          Image ID:      e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-integrations-rhel8@sha256:59c8d4fc46b89117e69a255ea987380c111d551e3086be179ea80fb783a12101
          Port:          <none>
          Host Port:     <none>
          Command:
            /usr/local/bin/propagation
            --leader-election-lease-duration=137
            --renew-deadline=107
            --retry-period=26
          State:          Waiting
            Reason:       CrashLoopBackOff
          Last State:     Terminated
            Reason:       OOMKilled
            Exit Code:    137
            Started:      Wed, 05 Apr 2023 14:39:38 +0000
            Finished:     Wed, 05 Apr 2023 14:39:52 +0000
          Ready:          False
          Restart Count:  171
          Limits:
            cpu:     500m
            memory:  128Mi
          Requests:
            cpu:        10m
            memory:     64Mi
          Liveness:     exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
          Readiness:    exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
          Environment:  <none>
          Mounts:
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vp7gz (ro)
        multicluster-integrations-syncresource:
          Container ID:  cri-o://e28ab4d1ed0c48674527c534393812bb0ca23015db4cf3be9d8fced963201ff5
          Image:         e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-integrations-rhel8@sha256:9227443a4a57c432f48301019af17691c9070778210a3f22425bd4d8f85bcc29
          Image ID:      e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-integrations-rhel8@sha256:59c8d4fc46b89117e69a255ea987380c111d551e3086be179ea80fb783a12101
          Port:          <none>
          Host Port:     <none>
          Command:
            /usr/local/bin/gitopssyncresc
            --appset-resource-dir=/etc/gitops-resources
            --sync-interval=10
            --leader-election-lease-duration=137
            --renew-deadline=107
            --retry-period=26
          State:          Running
            Started:      Wed, 05 Apr 2023 04:59:55 +0000
          Last State:     Terminated
            Reason:       Error
            Exit Code:    2
            Started:      Wed, 05 Apr 2023 03:59:15 +0000
            Finished:     Wed, 05 Apr 2023 04:59:54 +0000
          Ready:          True
          Restart Count:  9
          Limits:
            cpu:     100m
            memory:  512Mi
          Requests:
            cpu:      25m
            memory:   64Mi
          Liveness:   exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
          Readiness:  exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
          Environment:
            WATCH_NAMESPACE:
            POD_NAME:          multicluster-integrations-75d6547fdf-q7mcv (v1:metadata.name)
            POD_NAMESPACE:     open-cluster-management (v1:metadata.namespace)
            DEPLOYMENT_LABEL:  multicluster-integrations-syncresource
            OPERATOR_NAME:     multicluster-integrations
          Mounts:
            /etc/gitops-resources from multicluster-integrations-syncresource (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vp7gz (ro)
        multicluster-integrations-aggregation:
          Container ID:  cri-o://45b7dc5c4abf8ede454d4c27221ef00836c6eade20ef037e465a887f7b5123f4
          Image:         e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-integrations-rhel8@sha256:9227443a4a57c432f48301019af17691c9070778210a3f22425bd4d8f85bcc29
          Image ID:      e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-integrations-rhel8@sha256:59c8d4fc46b89117e69a255ea987380c111d551e3086be179ea80fb783a12101
          Port:          <none>
          Host Port:     <none>
          Command:
            /usr/local/bin/multiclusterstatusaggregation
            --appset-resource-dir=/etc/gitops-resources
            --sync-interval=10
            --leader-election-lease-duration=137
            --renew-deadline=107
            --retry-period=26
          State:          Running
            Started:      Tue, 04 Apr 2023 22:34:20 +0000
          Ready:          True
          Restart Count:  0
          Limits:
            cpu:     100m
            memory:  512Mi
          Requests:
            cpu:      25m
            memory:   64Mi
          Liveness:   exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
          Readiness:  exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
          Environment:
            WATCH_NAMESPACE:
            POD_NAME:          multicluster-integrations-75d6547fdf-q7mcv (v1:metadata.name)
            POD_NAMESPACE:     open-cluster-management (v1:metadata.namespace)
            DEPLOYMENT_LABEL:  multicluster-integrations-aggregation
            OPERATOR_NAME:     multicluster-integrations
          Mounts:
            /etc/gitops-resources from multicluster-integrations-syncresource (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vp7gz (ro)
      Conditions:
        Type              Status
        Initialized       True
        Ready             False
        ContainersReady   False
        PodScheduled      True
      Volumes:
        multicluster-integrations-syncresource:
          Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
          Medium:
          SizeLimit:  <unset>
        kube-api-access-vp7gz:
          Type:                    Projected (a volume that contains injected data from multiple sources)
          TokenExpirationSeconds:  3607
          ConfigMapName:           kube-root-ca.crt
          ConfigMapOptional:       <nil>
          DownwardAPI:             true
          ConfigMapName:           openshift-service-ca.crt
          ConfigMapOptional:       <nil>
      QoS Class:                   Burstable
      Node-Selectors:              <none>
      Tolerations:                 node-role.kubernetes.io/infra:NoSchedule op=Exists
                                   node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                   node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
      Events:
        Type     Reason   Age                   From     Message
        ----     ------   ----                  ----     -------
        Normal   Pulled   86m (x155 over 16h)   kubelet  Container image "e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-integrations-rhel8@sha256:9227443a4a57c432f48301019af17691c9070778210a3f22425bd4d8f85bcc29" already present on machine
        Warning  BackOff  81s (x4028 over 14h)  kubelet  Back-off restarting failed container # oc logs -n open-cluster-management multicluster-integrations-75d6547fdf-q7mcv -c argocd-pull-integration-controller-manager --timestamps -p
      2023-04-05T14:39:39.769643277Z I0405 14:39:39.769471       1 request.go:690] Waited for 1.040518232s due to client-side throttling, not priority and fairness, request: GET:https://[fd02::1]:443/apis/monitoring.coreos.com/v1?timeout=32s
      2023-04-05T14:39:43.444673741Z 1.6807055834445755e+09   INFO    controller-runtime.metrics      Metrics server is starting to listen    {"addr": "0.0.0.0:8386"}
      2023-04-05T14:39:43.467355681Z 1.6807055834672885e+09   INFO    setup   found CRD applications.argoproj.io
      2023-04-05T14:39:43.467438162Z 1.6807055834674232e+09   INFO    setup   starting manager
      2023-04-05T14:39:43.467702957Z 1.6807055834676733e+09   INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8386"}
      2023-04-05T14:39:43.467821086Z 1.680705583467794e+09    INFO    Starting EventSource    {"controller": "application", "controllerGroup": "argoproj.io", "controllerKind": "Application", "source": "kind source: *v1alpha1.Application"}
      2023-04-05T14:39:43.467826679Z 1.6807055834678226e+09   INFO    Starting Controller     {"controller": "application", "controllerGroup": "argoproj.io", "controllerKind": "Application"}
      2023-04-05T14:39:43.467860255Z 1.6807055834678254e+09   INFO    Starting EventSource    {"controller": "manifestwork", "controllerGroup": "work.open-cluster-management.io", "controllerKind": "ManifestWork", "source": "kind source: *v1.ManifestWork"}
      2023-04-05T14:39:43.467865046Z 1.6807055834678595e+09   INFO    Starting Controller     {"controller": "manifestwork", "controllerGroup": "work.open-cluster-management.io", "controllerKind": "ManifestWork"}

       

              ming@redhat.com Mike Ng
              akrzos@redhat.com Alex Krzos
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: