-
Bug
-
Resolution: Unresolved
-
Major
-
ACM 2.15.0
-
False
-
-
False
-
-
-
ACM Console Train 34 - 2
-
Moderate
-
None
Description of problem:
While scale testing ACM to ZTP (Zero Touch Provision) 3500+ SNOs and have AAP configured with EDA to monitor a kafka event bus (provided by Multicluster Global Hub) to initiate a playbook for every successful CGU, the mce console was crashlooping due to probe failures.
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ... multicluster-engine console-mce-console-5579749956-259sh 1/1 Running 7 (26m ago) 4h59m fd01:0:0:3::31 d16-h14-000-r650 <none> <none> multicluster-engine console-mce-console-5579749956-kq4wz 1/1 Running 12 (25m ago) 4h59m fd01:0:0:1::72 d16-h10-000-r650 <none> <none> ...
In oc describe output we see that the probes are the cause for the crashlooping behavior
...
console:
Container ID: cri-o://2c9ff2a52d6c6d8cc13e08b8942be76441093982e69a9d430d653a8a488ed68c
Image: registry.redhat.io/multicluster-engine/console-mce-rhel9@sha256:519c4d77a3a0c1bc85a7d7ef5218544e447ca8fd3708e03faca8eb183667f534
Image ID: registry.redhat.io/multicluster-engine/console-mce-rhel9@sha256:519c4d77a3a0c1bc85a7d7ef5218544e447ca8fd3708e03faca8eb183667f534
Port: 3000/TCP
Host Port: 0/TCP
State: Running
Started: Sat, 15 Nov 2025 00:50:02 +0000
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Sat, 15 Nov 2025 00:47:32 +0000
Finished: Sat, 15 Nov 2025 00:50:02 +0000
Ready: True
Restart Count: 7
Requests:
cpu: 3m
memory: 40Mi
Liveness: http-get https://:3000/livenessProbe delay=10s timeout=10s period=10s #success=1 #failure=3
Readiness: http-get https://:3000/readinessProbe delay=0s timeout=10s period=10s #success=1 #failure=3
Environment:
PORT: 3000
CLUSTER_API_URL: https://kubernetes.default.svc:443
Mounts:
/app/certs from console-mce-console-certs (rw)
/app/config from console-mce-console-mce-config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wk6q5 (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
console-mce-console-certs:
Type: Secret (a volume populated by a Secret)
SecretName: console-mce-console-certs
Optional: false
console-mce-console-mce-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: console-mce-config
Optional: false
kube-api-access-wk6q5:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/infra:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 66m kubelet Liveness probe failed: Get "https://[fd01:0:0:3::31]:3000/livenessProbe": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 64m (x2 over 80m) kubelet Readiness probe failed: Get "https://[fd01:0:0:3::31]:3000/readinessProbe": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 63m kubelet Readiness probe failed: Get "https://[fd01:0:0:3::31]:3000/readinessProbe": read tcp [fd01:0:0:3::2]:51308->[fd01:0:0:3::31]:3000: read: connection reset by peer
Warning Unhealthy 61m kubelet Readiness probe failed: Get "https://[fd01:0:0:3::31]:3000/readinessProbe": dial tcp [fd01:0:0:3::31]:3000: connect: connection refused
Warning Unhealthy 45m (x19 over 77m) kubelet Readiness probe failed: Get "https://[fd01:0:0:3::31]:3000/readinessProbe": context deadline exceeded
Normal Killing 26m (x7 over 64m) kubelet Container console failed liveness probe, will be restarted
Normal Pulled 26m (x7 over 63m) kubelet Container image "registry.redhat.io/multicluster-engine/console-mce-rhel9@sha256:519c4d77a3a0c1bc85a7d7ef5218544e447ca8fd3708e03faca8eb183667f534" already present on machine
Normal Created 26m (x7 over 63m) kubelet Created container: console
Normal Started 26m (x7 over 63m) kubelet Started container console
Warning Unhealthy 15m (x32 over 64m) kubelet Liveness probe failed: Get "https://[fd01:0:0:3::31]:3000/livenessProbe": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 5m8s (x42 over 66m) kubelet Readiness probe failed: Get "https://[fd01:0:0:3::31]:3000/readinessProbe": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 9s (x38 over 77m) kubelet Liveness probe failed: Get "https://[fd01:0:0:3::31]:3000/livenessProbe": context deadline exceeded
Version-Release number of selected component (if applicable):
OCP - 4.20.2
Deployed OCP - 4.20.2
ACM - 2.15.0-DOWNSTREAM-2025-10-29-01-15-32
AAP - aap-operator.v2.6.0-0.1762261209
How reproducible:
This occured in all scale tests with 3500+ managed clusters with AAP and MCGH installed and configured.
Steps to Reproduce:
- ...