Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: MCE 2.10.1
Affects Version/s: ACM 2.15.0
Component/s: Console
Labels:

Story Points:
0
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Sprint:
ACM Console Train 35 - 1
Severity:
Moderate

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Description of problem:

While scale testing ACM to ZTP (Zero Touch Provision) 3500+ SNOs and have AAP configured with EDA to monitor a kafka event bus (provided by Multicluster Global Hub) to initiate a playbook for every successful CGU, the mce console was crashlooping due to probe failures.

NAMESPACE                                          NAME                                                              READY   STATUS             RESTARTS        AGE     IP                NODE               NOMINATED NODE   READINESS GATES
...
multicluster-engine                                console-mce-console-5579749956-259sh                              1/1     Running            7 (26m ago)     4h59m   fd01:0:0:3::31    d16-h14-000-r650   <none>           <none>
multicluster-engine                                console-mce-console-5579749956-kq4wz                              1/1     Running            12 (25m ago)    4h59m   fd01:0:0:1::72    d16-h10-000-r650   <none>           <none>
...

In oc describe output we see that the probes are the cause for the crashlooping behavior

...
  console:
    Container ID:   cri-o://2c9ff2a52d6c6d8cc13e08b8942be76441093982e69a9d430d653a8a488ed68c
    Image:          registry.redhat.io/multicluster-engine/console-mce-rhel9@sha256:519c4d77a3a0c1bc85a7d7ef5218544e447ca8fd3708e03faca8eb183667f534
    Image ID:       registry.redhat.io/multicluster-engine/console-mce-rhel9@sha256:519c4d77a3a0c1bc85a7d7ef5218544e447ca8fd3708e03faca8eb183667f534
    Port:           3000/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Sat, 15 Nov 2025 00:50:02 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Sat, 15 Nov 2025 00:47:32 +0000
      Finished:     Sat, 15 Nov 2025 00:50:02 +0000
    Ready:          True
    Restart Count:  7
    Requests:
      cpu:      3m
      memory:   40Mi
    Liveness:   http-get https://:3000/livenessProbe delay=10s timeout=10s period=10s #success=1 #failure=3
    Readiness:  http-get https://:3000/readinessProbe delay=0s timeout=10s period=10s #success=1 #failure=3
    Environment:
      PORT:             3000
      CLUSTER_API_URL:  https://kubernetes.default.svc:443
    Mounts:
      /app/certs from console-mce-console-certs (rw)
      /app/config from console-mce-console-mce-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wk6q5 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  console-mce-console-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  console-mce-console-certs
    Optional:    false
  console-mce-console-mce-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      console-mce-config
    Optional:  false
  kube-api-access-wk6q5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    Optional:                false
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/infra:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                  From     Message
  ----     ------     ----                 ----     -------
  Warning  Unhealthy  66m                  kubelet  Liveness probe failed: Get "https://[fd01:0:0:3::31]:3000/livenessProbe": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  64m (x2 over 80m)    kubelet  Readiness probe failed: Get "https://[fd01:0:0:3::31]:3000/readinessProbe": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  63m                  kubelet  Readiness probe failed: Get "https://[fd01:0:0:3::31]:3000/readinessProbe": read tcp [fd01:0:0:3::2]:51308->[fd01:0:0:3::31]:3000: read: connection reset by peer
  Warning  Unhealthy  61m                  kubelet  Readiness probe failed: Get "https://[fd01:0:0:3::31]:3000/readinessProbe": dial tcp [fd01:0:0:3::31]:3000: connect: connection refused
  Warning  Unhealthy  45m (x19 over 77m)   kubelet  Readiness probe failed: Get "https://[fd01:0:0:3::31]:3000/readinessProbe": context deadline exceeded
  Normal   Killing    26m (x7 over 64m)    kubelet  Container console failed liveness probe, will be restarted
  Normal   Pulled     26m (x7 over 63m)    kubelet  Container image "registry.redhat.io/multicluster-engine/console-mce-rhel9@sha256:519c4d77a3a0c1bc85a7d7ef5218544e447ca8fd3708e03faca8eb183667f534" already present on machine
  Normal   Created    26m (x7 over 63m)    kubelet  Created container: console
  Normal   Started    26m (x7 over 63m)    kubelet  Started container console
  Warning  Unhealthy  15m (x32 over 64m)   kubelet  Liveness probe failed: Get "https://[fd01:0:0:3::31]:3000/livenessProbe": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  5m8s (x42 over 66m)  kubelet  Readiness probe failed: Get "https://[fd01:0:0:3::31]:3000/readinessProbe": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  9s (x38 over 77m)    kubelet  Liveness probe failed: Get "https://[fd01:0:0:3::31]:3000/livenessProbe": context deadline exceeded

Version-Release number of selected component (if applicable):

OCP - 4.20.2

Deployed OCP - 4.20.2

ACM - 2.15.0-DOWNSTREAM-2025-10-29-01-15-32

AAP - aap-operator.v2.6.0-0.1762261209

How reproducible:

This occured in all scale tests with 3500+ managed clusters with AAP and MCGH installed and configured.

Steps to Reproduce:

...

Actual results:

Expected results:

Additional info:

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

console-mce-console-5579749956-5hrxr.log
20 kB
2025/11/18 1:54 PM
console-mce-console-5579749956-svthf.current.log.crdownload
566 kB
2025/11/18 1:54 PM
console-mce-console-5579749956-svthf.previous.log.crdownload
20 kB
2025/11/18 1:54 PM
RC2-cluster-UI.png
184 kB
2025/11/24 6:32 PM

causes

ACM-27724 Microshift registration cluster is not visible clicking on the device "view cluster" in the UI

Closed

ACM-27549 Cluster Actions (i.e. Hibernate, Detach, Resume) is not being updated in the UI

Closed

is cloned by

ACM-27468 console-mce-console readiness and liveness probe failures at scale during ACM ZTP with AAP and EDA [MCE 2.9]

Closed

ACM-27638 console-mce-console readiness and liveness probe failures at scale during ACM ZTP with AAP and EDA

Closed

relates to

ACM-27724 Microshift registration cluster is not visible clicking on the device "view cluster" in the UI

Closed

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates