Loading...

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: ACM 2.6.Z
Affects Version/s: ACM 2.6.4
Component/s: Cluster Lifecycle, Server Foundation
Labels:
- perfscale-telco-5g
- telco-5g

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Git Pull Request:
https://github.com/stolostron/backplane-operator/pull/388
Intelligence Requested:
Market:

Regression:
No

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

While deploying ~2500 SNOs with ACM 2.6.4 for ACM upgrade testing, the managedcluster-import-controller-v2 is OOMing and preventing further clusters from being imported around a scale of ~2400 clusters. Likely we just need the same solution in https://issues.redhat.com/browse/ACM-2275 backported to 2.6

Version-Release number of selected component (if applicable):

ACM - 2.6.4-DOWNSTREAM-2023-01-31-18-35-03

OCP Hub 4.12.1, SNOs 4.12.1

How reproducible:

Steps to Reproduce:

...

Actual results:

Expected results:

Additional info:

# oc get po -n multicluster-engine -l app=managedcluster-import-controller-v2
NAME                                                  READY   STATUS    RESTARTS         AGE
managedcluster-import-controller-v2-6d4bdb4d8-4xm4r   1/1     Running   79 (4m42s ago)   16h
managedcluster-import-controller-v2-6d4bdb4d8-pw6pw   1/1     Running   80 (4m4s ago)    16h
# oc get managedcluster --no-headers | grep -v Unknown | grep True -c
2413
# oc get deploy -n multicluster-engine managedcluster-import-controller-v2 -o json | jq '.spec.template.spec.containers[0].resources'
{
  "limits": {
    "cpu": "500m",
    "memory": "2Gi"
  },
  "requests": {
    "cpu": "50m",
    "memory": "96Mi"
  }
}

A describe on one of the pods showing OOM:

# oc describe po -n multicluster-engine managedcluster-import-controller-v2-6d4bdb4d8-4xm4r
Name:             managedcluster-import-controller-v2-6d4bdb4d8-4xm4r
Namespace:        multicluster-engine
Priority:         0
Service Account:  managedcluster-import-controller-v2
Node:             e27-h05-000-r650/fc00:1002::7
Start Time:       Mon, 13 Feb 2023 21:34:43 +0000
Labels:           app=managedcluster-import-controller-v2
                  ocm-antiaffinity-selector=managedclusterimport
                  pod-template-hash=6d4bdb4d8
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["fd01:0:0:1::3c/64"],"mac_address":"0a:58:df:cf:6d:db","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:...
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "ovn-kubernetes",
                        "interface": "eth0",
                        "ips": [
                            "fd01:0:0:1::3c"
                        ],
                        "mac": "0a:58:df:cf:6d:db",
                        "default": true,
                        "dns": {}
                    }]
                  k8s.v1.cni.cncf.io/networks-status:
                    [{
                        "name": "ovn-kubernetes",
                        "interface": "eth0",
                        "ips": [
                            "fd01:0:0:1::3c"
                        ],
                        "mac": "0a:58:df:cf:6d:db",
                        "default": true,
                        "dns": {}
                    }]
                  openshift.io/scc: restricted-v2
                  scheduler.alpha.kubernetes.io/critical-pod:
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:           Running
IP:               fd01:0:0:1::3c
IPs:
  IP:           fd01:0:0:1::3c
Controlled By:  ReplicaSet/managedcluster-import-controller-v2-6d4bdb4d8
Containers:
  managedcluster-import-controller:
    Container ID:   cri-o://816e98a3aa209f91b5394e7ba8099deca3cee373650ace0a0f69d9d3eb4d266e
    Image:          e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/managedcluster-import-controller-rhel8@sha256:4868d67485b6392985a495ae3fc177e4b090fb252fdd6dffc40e353aa0db126d
    Image ID:       e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/managedcluster-import-controller-rhel8@sha256:4868d67485b6392985a495ae3fc177e4b090fb252fdd6dffc40e353aa0db126d
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Tue, 14 Feb 2023 14:51:35 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 14 Feb 2023 14:42:27 +0000
      Finished:     Tue, 14 Feb 2023 14:51:08 +0000
    Ready:          True
    Restart Count:  82
    Limits:
      cpu:     500m
      memory:  2Gi
    Requests:
      cpu:     50m
      memory:  96Mi
    Environment:
      WATCH_NAMESPACE:
      POD_NAME:                     managedcluster-import-controller-v2-6d4bdb4d8-4xm4r (v1:metadata.name)
      MAX_CONCURRENT_RECONCILES:    10
      OPERATOR_NAME:                managedcluster-import-controller
      DEFAULT_IMAGE_PULL_SECRET:    multiclusterhub-operator-pull-secret
      DEFAULT_IMAGE_REGISTRY:
      REGISTRATION_OPERATOR_IMAGE:  e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/registration-operator-rhel8@sha256:85dc5defbf986e36842dfa8cbf3ff764c1a4636779eb6d0bb553bc52263f6867
      REGISTRATION_IMAGE:           e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/registration-rhel8@sha256:112c8f9f6c237dace9f2137525256ac034b8fd7a853ccd00de3963c84799fa3b
      WORK_IMAGE:                   e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/work-rhel8@sha256:2a1efeda7ce5b3a974078b31b603f157b087d7fc8408e40e4c6e2c03efcf4530
      POD_NAMESPACE:                multicluster-engine (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9rkdh (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-9rkdh:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/infra:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Normal   Started  42m (x75 over 12h)    kubelet  Started container managedcluster-import-controller
  Normal   Pulled   15m (x80 over 12h)    kubelet  Container image "e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/managedcluster-import-controller-rhel8@sha256:4868d67485b6392985a495ae3fc177e4b090fb252fdd6dffc40e353aa0db126d" already present on machine
  Normal   Created  15m (x80 over 12h)    kubelet  Created container managedcluster-import-controller
  Warning  BackOff  3m40s (x99 over 11h)  kubelet  Back-off restarting failed container

The attached image of a test result shows a ceiling hit before all provisioned clusters were managed (green managed line never meets red sno_install_completed line)

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

share2-20230214-104323.png
2023/02/14 2:37 PM
121 kB
Alex Krzos

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates