Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-3468

ocm-webhook OOM at scale of ~2500 managedclusters (ACM 2.6)

XMLWordPrintable

    • False
    • None
    • False
    • No

      Description of problem:

      While deploying ~2500 SNOs with ACM 2.6.4 for ACM upgrade testing, the ocm-webhook containers are OOMing. Likely we just need the same solution in https://issues.redhat.com/browse/ACM-2305 backported to 2.6

      Version-Release number of selected component (if applicable):

      ACM - 2.6.4-DOWNSTREAM-2023-01-31-18-35-03

      OCP Hub 4.12.1, SNOs 4.12.1

      How reproducible:

      Steps to Reproduce:

      1.  
      2.  
      3. ...

      Actual results:

      Expected results:

      Additional info:

      # oc get po -n multicluster-engine -l control-plane=ocm-webhook
      NAME                           READY   STATUS    RESTARTS        AGE
      ocm-webhook-7bf8855796-6dwxb   1/1     Running   30 (91m ago)    17h
      ocm-webhook-7bf8855796-vpzvc   1/1     Running   14 (139m ago)   17h
      # oc get managedcluster --no-headers | grep -v Unknown | grep True -c
      2413
      # oc get deploy -n multicluster-engine ocm-webhook -o json | jq '.spec.template.spec.containers[0].resources'
      {
        "limits": {
          "memory": "256Mi"
        },
        "requests": {
          "cpu": "50m",
          "memory": "128Mi"
        }
      } 

      A describe on the pods show OOM:

      # oc describe po -n multicluster-engine ocm-webhook-7bf8855796-6dwxb
      Name:             ocm-webhook-7bf8855796-6dwxb
      Namespace:        multicluster-engine
      Priority:         0
      Service Account:  ocm-foundation-sa
      Node:             e27-h03-000-r650/fc00:1002::6
      Start Time:       Mon, 13 Feb 2023 21:34:43 +0000
      Labels:           control-plane=ocm-webhook
                        ocm-antiaffinity-selector=ocm-webhook
                        pod-template-hash=7bf8855796
      Annotations:      k8s.ovn.org/pod-networks:
                          {"default":{"ip_addresses":["fd01:0:0:3::34/64"],"mac_address":"0a:58:f9:03:bf:bf","gateway_ips":["fd01:0:0:3::1"],"ip_address":"fd01:0:0:...
                        k8s.v1.cni.cncf.io/network-status:
                          [{
                              "name": "ovn-kubernetes",
                              "interface": "eth0",
                              "ips": [
                                  "fd01:0:0:3::34"
                              ],
                              "mac": "0a:58:f9:03:bf:bf",
                              "default": true,
                              "dns": {}
                          }]
                        k8s.v1.cni.cncf.io/networks-status:
                          [{
                              "name": "ovn-kubernetes",
                              "interface": "eth0",
                              "ips": [
                                  "fd01:0:0:3::34"
                              ],
                              "mac": "0a:58:f9:03:bf:bf",
                              "default": true,
                              "dns": {}
                          }]
                        openshift.io/scc: restricted-v2
                        seccomp.security.alpha.kubernetes.io/pod: runtime/default
      Status:           Running
      IP:               fd01:0:0:3::34
      IPs:
        IP:           fd01:0:0:3::34
      Controlled By:  ReplicaSet/ocm-webhook-7bf8855796
      Containers:
        ocm-webhook:
          Container ID:  cri-o://c59972bfa1bfc5da65e62bd7131bef045c0e0fcca5507b4734e1a960487f0d1a
          Image:         e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-manager-rhel8@sha256:59d254159c3292763594d496f36c96d995b8a86367ceda1637db49b54fc95a4d
          Image ID:      e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-manager-rhel8@sha256:59d254159c3292763594d496f36c96d995b8a86367ceda1637db49b54fc95a4d
          Port:          8000/TCP
          Host Port:     0/TCP
          Args:
            /webhook
            --tls-cert-file=/var/run/ocm-webhook/tls.crt
            --tls-private-key-file=/var/run/ocm-webhook/tls.key
          State:          Running
            Started:      Tue, 14 Feb 2023 13:18:23 +0000
          Last State:     Terminated
            Reason:       OOMKilled
            Exit Code:    137
            Started:      Tue, 14 Feb 2023 13:12:16 +0000
            Finished:     Tue, 14 Feb 2023 13:13:14 +0000
          Ready:          True
          Restart Count:  30
          Limits:
            memory:  256Mi
          Requests:
            cpu:        50m
            memory:     128Mi
          Liveness:     exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
          Readiness:    exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
          Environment:  <none>
          Mounts:
            /var/run/ocm-webhook from webhook-cert (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-npwdj (ro)
      Conditions:
        Type              Status
        Initialized       True
        Ready             True
        ContainersReady   True
        PodScheduled      True
      Volumes:
        webhook-cert:
          Type:        Secret (a volume populated by a Secret)
          SecretName:  ocm-webhook
          Optional:    false
        kube-api-access-npwdj:
          Type:                    Projected (a volume that contains injected data from multiple sources)
          TokenExpirationSeconds:  3607
          ConfigMapName:           kube-root-ca.crt
          ConfigMapOptional:       <nil>
          DownwardAPI:             true
          ConfigMapName:           openshift-service-ca.crt
          ConfigMapOptional:       <nil>
      QoS Class:                   Burstable
      Node-Selectors:              <none>
      Tolerations:                 node-role.kubernetes.io/infra:NoSchedule op=Exists
                                   node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                   node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
      Events:
        Type     Reason     Age                  From     Message
        ----     ------     ----                 ----     -------
        Normal   Pulled     142m (x19 over 12h)  kubelet  Container image "e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-manager-rhel8@sha256:59d254159c3292763594d496f36c96d995b8a86367ceda1637db49b54fc95a4d" already present on machine
        Normal   Created    142m (x20 over 17h)  kubelet  Created container ocm-webhook
        Normal   Started    142m (x20 over 17h)  kubelet  Started container ocm-webhook
        Warning  Unhealthy  138m (x18 over 12h)  kubelet  Readiness probe failed:
        Warning  Unhealthy  118m (x23 over 12h)  kubelet  Liveness probe failed:
        Warning  BackOff    93m (x291 over 12h)  kubelet  Back-off restarting failed container
      

      It is unclear if these OOMs are causing any fatal issues with the environment at the current time however in the ACM 2.7 testing it was a ceiling for provisioning more clusters. (However for the purposes of this testing we didn't breach that ceiling (yet))

              daliu@redhat.com DangPeng Liu
              akrzos@redhat.com Alex Krzos
              Alex Krzos Alex Krzos
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: