Loading...

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: ACM 2.6.Z
Affects Version/s: ACM 2.6.4
Component/s: Server Foundation
Labels:
- perfscale-telco-5g
- telco-5g

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

Regression:
No

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem:

While deploying ~2500 SNOs with ACM 2.6.4 for ACM upgrade testing, the ocm-webhook containers are OOMing. Likely we just need the same solution in https://issues.redhat.com/browse/ACM-2305 backported to 2.6

Version-Release number of selected component (if applicable):

ACM - 2.6.4-DOWNSTREAM-2023-01-31-18-35-03

OCP Hub 4.12.1, SNOs 4.12.1

How reproducible:

Steps to Reproduce:

...

Actual results:

Expected results:

Additional info:

# oc get po -n multicluster-engine -l control-plane=ocm-webhook
NAME                           READY   STATUS    RESTARTS        AGE
ocm-webhook-7bf8855796-6dwxb   1/1     Running   30 (91m ago)    17h
ocm-webhook-7bf8855796-vpzvc   1/1     Running   14 (139m ago)   17h
# oc get managedcluster --no-headers | grep -v Unknown | grep True -c
2413
# oc get deploy -n multicluster-engine ocm-webhook -o json | jq '.spec.template.spec.containers[0].resources'
{
  "limits": {
    "memory": "256Mi"
  },
  "requests": {
    "cpu": "50m",
    "memory": "128Mi"
  }
}

A describe on the pods show OOM:

# oc describe po -n multicluster-engine ocm-webhook-7bf8855796-6dwxb
Name:             ocm-webhook-7bf8855796-6dwxb
Namespace:        multicluster-engine
Priority:         0
Service Account:  ocm-foundation-sa
Node:             e27-h03-000-r650/fc00:1002::6
Start Time:       Mon, 13 Feb 2023 21:34:43 +0000
Labels:           control-plane=ocm-webhook
                  ocm-antiaffinity-selector=ocm-webhook
                  pod-template-hash=7bf8855796
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["fd01:0:0:3::34/64"],"mac_address":"0a:58:f9:03:bf:bf","gateway_ips":["fd01:0:0:3::1"],"ip_address":"fd01:0:0:...
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "ovn-kubernetes",
                        "interface": "eth0",
                        "ips": [
                            "fd01:0:0:3::34"
                        ],
                        "mac": "0a:58:f9:03:bf:bf",
                        "default": true,
                        "dns": {}
                    }]
                  k8s.v1.cni.cncf.io/networks-status:
                    [{
                        "name": "ovn-kubernetes",
                        "interface": "eth0",
                        "ips": [
                            "fd01:0:0:3::34"
                        ],
                        "mac": "0a:58:f9:03:bf:bf",
                        "default": true,
                        "dns": {}
                    }]
                  openshift.io/scc: restricted-v2
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:           Running
IP:               fd01:0:0:3::34
IPs:
  IP:           fd01:0:0:3::34
Controlled By:  ReplicaSet/ocm-webhook-7bf8855796
Containers:
  ocm-webhook:
    Container ID:  cri-o://c59972bfa1bfc5da65e62bd7131bef045c0e0fcca5507b4734e1a960487f0d1a
    Image:         e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-manager-rhel8@sha256:59d254159c3292763594d496f36c96d995b8a86367ceda1637db49b54fc95a4d
    Image ID:      e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-manager-rhel8@sha256:59d254159c3292763594d496f36c96d995b8a86367ceda1637db49b54fc95a4d
    Port:          8000/TCP
    Host Port:     0/TCP
    Args:
      /webhook
      --tls-cert-file=/var/run/ocm-webhook/tls.crt
      --tls-private-key-file=/var/run/ocm-webhook/tls.key
    State:          Running
      Started:      Tue, 14 Feb 2023 13:18:23 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 14 Feb 2023 13:12:16 +0000
      Finished:     Tue, 14 Feb 2023 13:13:14 +0000
    Ready:          True
    Restart Count:  30
    Limits:
      memory:  256Mi
    Requests:
      cpu:        50m
      memory:     128Mi
    Liveness:     exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
    Readiness:    exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/ocm-webhook from webhook-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-npwdj (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  webhook-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ocm-webhook
    Optional:    false
  kube-api-access-npwdj:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/infra:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                  From     Message
  ----     ------     ----                 ----     -------
  Normal   Pulled     142m (x19 over 12h)  kubelet  Container image "e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-manager-rhel8@sha256:59d254159c3292763594d496f36c96d995b8a86367ceda1637db49b54fc95a4d" already present on machine
  Normal   Created    142m (x20 over 17h)  kubelet  Created container ocm-webhook
  Normal   Started    142m (x20 over 17h)  kubelet  Started container ocm-webhook
  Warning  Unhealthy  138m (x18 over 12h)  kubelet  Readiness probe failed:
  Warning  Unhealthy  118m (x23 over 12h)  kubelet  Liveness probe failed:
  Warning  BackOff    93m (x291 over 12h)  kubelet  Back-off restarting failed container

It is unclear if these OOMs are causing any fatal issues with the environment at the current time however in the ACM 2.7 testing it was a ceiling for provisioning more clusters. (However for the purposes of this testing we didn't breach that ceiling (yet))

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Easy Agile Planning Poker

Activity

People

Dates