Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: ACM 2.9.0
Affects Version/s: ACM 2.9.0
Component/s: Application Lifecycle
Labels:
- perfscale-telco-5g
- telco-5g

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Severity:
Important

Regression:
No

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

The multicluster-operators-gitopscluster container is OOM crashlooping in the multicluster-operators-application pod while the large scale environment has deployed and managed 3500+ SNOs.

# oc get po -n open-cluster-management multicluster-operators-application-fbc4696f6-7c5fj
NAME                                                 READY   STATUS    RESTARTS          AGE
multicluster-operators-application-fbc4696f6-7c5fj   3/3     Running   883 (4m48s ago)   4d22h

Count of clusters (Not all are available since some deployed clusters fail to install)

# oc get managedcluster -A --no-headers | wc -l
3619

Version-Release number of selected component (if applicable):

ACM - 2.9.0-DOWNSTREAM-2023-09-01-02-58-15

Hub cluster is OCP 4.13.10

Deployed SNOs were originally 4.12.29 and later upgraded to 4.13.9 (The OOM crash-looping actually started prior to the upgrade)

How reproducible:

Steps to Reproduce:

...

Actual results:

Expected results:

Additional info:

Pod description:

oc describe po -n open-cluster-management multicluster-operators-application-fbc4696f6-7c5fj
Name:             multicluster-operators-application-fbc4696f6-7c5fj
Namespace:        open-cluster-management
Priority:         0
Service Account:  multicluster-applications
Node:             e27-h05-000-r650/fc00:1004::7
Start Time:       Fri, 01 Sep 2023 21:03:24 +0000
Labels:           app=multicluster-operators-application
                  ocm-antiaffinity-selector=multicluster-operators-application
                  pod-template-hash=fbc4696f6
Annotations:      k8s.ovn.org/pod-networks:
                    {"default":{"ip_addresses":["fd01:0:0:1::42/64"],"mac_address":"0a:58:13:d6:9f:ba","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:...
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "ovn-kubernetes",
                        "interface": "eth0",
                        "ips": [
                            "fd01:0:0:1::42"
                        ],
                        "mac": "0a:58:13:d6:9f:ba",
                        "default": true,
                        "dns": {}
                    }]
                  openshift.io/scc: restricted-v2
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:           Running
IP:               fd01:0:0:1::42
IPs:
  IP:           fd01:0:0:1::42
Controlled By:  ReplicaSet/multicluster-operators-application-fbc4696f6
Containers:
  multicluster-operators-placementrule:
    Container ID:  cri-o://776f4936cff86a713c889ca10d1a66b4815cdb25303cbe644564a5479cf7bb83
    Image:         e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-subscription-rhel8@sha256:a92716fa2c4798ceb9a1cc35a84f57874a703c2ba36b8b630603f4c987078a58
    Image ID:      e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-subscription-rhel8@sha256:2b39642174254dc51a044f55c9ca234003c62bb855f5b72eca6c11a5cd845e7d
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/local/bin/multicluster-operators-placementrule
      --alsologtostderr
      --leader-election-lease-duration=137s
      --leader-election-renew-deadline=107s
      --leader-election-retry-period=26s
    State:          Running
      Started:      Fri, 01 Sep 2023 21:04:48 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 01 Sep 2023 21:03:35 +0000
      Finished:     Fri, 01 Sep 2023 21:04:45 +0000
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     1500m
      memory:  1536Mi
    Requests:
      cpu:      300m
      memory:   64Mi
    Liveness:   exec [ls] delay=30s timeout=1s period=30s #success=1 #failure=3
    Readiness:  exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
    Environment:
      WATCH_NAMESPACE:
      POD_NAME:          multicluster-operators-application-fbc4696f6-7c5fj (v1:metadata.name)
      POD_NAMESPACE:     open-cluster-management (v1:metadata.namespace)
      DEPLOYMENT_LABEL:  multicluster-operators-placementrule
      OPERATOR_NAME:     multicluster-operators-application
    Mounts:
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7bxtz (ro)
  multicluster-operators-gitopscluster:
    Container ID:  cri-o://7b44880d1ac37e1c33bf7df783b010271e6f1f4a1622bda4de4bde407b393250
    Image:         e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-integrations-rhel8@sha256:80476ac2be7f80532025ac426dcde05e196b8ad6d57737e7a2febf4ab16214e4
    Image ID:      e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicloud-integrations-rhel8@sha256:20926f7ace2b880a0fc90251aa98574147737262a3ca6a1022533b0580a9314f
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/local/bin/gitopscluster
      --alsologtostderr
      --leader-election-lease-duration=137s
      --leader-election-renew-deadline=107s
      --leader-election-retry-period=26s
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 06 Sep 2023 19:52:45 +0000
      Finished:     Wed, 06 Sep 2023 19:57:29 +0000
    Ready:          False
    Restart Count:  882
    Limits:
      cpu:     100m
      memory:  1Gi
    Requests:
      cpu:      25m
      memory:   64Mi
    Liveness:   exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
    Readiness:  exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
    Environment:
      WATCH_NAMESPACE:
      POD_NAME:          multicluster-operators-application-fbc4696f6-7c5fj (v1:metadata.name)
      POD_NAMESPACE:     open-cluster-management (v1:metadata.namespace)
      DEPLOYMENT_LABEL:  multicluster-operators-gitopscluster
      OPERATOR_NAME:     multicluster-operators-application
    Mounts:
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7bxtz (ro)
  multicluster-operators-application:
    Container ID:  cri-o://b7c2c2034019dcab334f7a8886180e2f60a1fc73eedab7b18167bae0d44efb35
    Image:         e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-application-rhel8@sha256:ca0947ce525caa24d34e7a17dc16dedecb1c07b3ad300bea7377bda62cbda4a5
    Image ID:      e27-h01-000-r650.rdu2.scalelab.redhat.com:5000/acm-d/multicluster-operators-application-rhel8@sha256:8780b674a6ea8081e5213d1257543b65bb3b177d89ac166148db6434d575f419
    Port:          9442/TCP
    Host Port:     0/TCP
    Command:
      /usr/local/bin/multicluster-operators-application
      --alsologtostderr
      --leader-election-lease-duration=137s
      --leader-election-renew-deadline=107s
      --leader-election-retry-period=26s
    State:          Running
      Started:      Fri, 01 Sep 2023 21:03:36 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  512Mi
    Requests:
      cpu:      25m
      memory:   64Mi
    Liveness:   exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
    Readiness:  exec [ls] delay=15s timeout=1s period=15s #success=1 #failure=3
    Environment:
      WATCH_NAMESPACE:
      POD_NAME:          multicluster-operators-application-fbc4696f6-7c5fj (v1:metadata.name)
      POD_NAMESPACE:     open-cluster-management (v1:metadata.namespace)
      DEPLOYMENT_LABEL:  multicluster-operators-application
      OPERATOR_NAME:     multicluster-operators-application
    Mounts:
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7bxtz (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-7bxtz:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/infra:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                      From     Message
  ----     ------     ----                     ----     -------
  Warning  Unhealthy  53m                      kubelet  Liveness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of ec3c13e66a67c4d13c99b1ee843b3e29904a0389292afbb5842f31e214c74876 is running failed: container process not found
  Warning  Unhealthy  53m                      kubelet  Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of ec3c13e66a67c4d13c99b1ee843b3e29904a0389292afbb5842f31e214c74876 is running failed: container process not found
  Warning  BackOff    36s (x13165 over 4d16h)  kubelet  Back-off restarting failed container multicluster-operators-gitopscluster in pod multicluster-operators-application-fbc4696f6-7c5fj_open-cluster-management(a4d99811-c098-4a94-9731-d695dabae4e9)

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

multicluster-operators-gitopscluster.log
2023/09/06 8:03 PM
35 kB
Alex Krzos

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates