Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-26374

console-mce-console readiness and liveness probe failures at scale during ACM ZTP with AAP and EDA

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • ACM Console Train 34 - 2
    • Moderate
    • None

      Description of problem:

      While scale testing ACM to ZTP (Zero Touch Provision) 3500+ SNOs and have AAP configured with EDA to monitor a kafka event bus (provided by Multicluster Global Hub) to initiate a playbook for every successful CGU, the mce console was crashlooping due to probe failures.

       

      NAMESPACE                                          NAME                                                              READY   STATUS             RESTARTS        AGE     IP                NODE               NOMINATED NODE   READINESS GATES
      ...
      multicluster-engine                                console-mce-console-5579749956-259sh                              1/1     Running            7 (26m ago)     4h59m   fd01:0:0:3::31    d16-h14-000-r650   <none>           <none>
      multicluster-engine                                console-mce-console-5579749956-kq4wz                              1/1     Running            12 (25m ago)    4h59m   fd01:0:0:1::72    d16-h10-000-r650   <none>           <none>
      ...

      In oc describe output we see that the probes are the cause for the crashlooping behavior

       

      ...
        console:
          Container ID:   cri-o://2c9ff2a52d6c6d8cc13e08b8942be76441093982e69a9d430d653a8a488ed68c
          Image:          registry.redhat.io/multicluster-engine/console-mce-rhel9@sha256:519c4d77a3a0c1bc85a7d7ef5218544e447ca8fd3708e03faca8eb183667f534
          Image ID:       registry.redhat.io/multicluster-engine/console-mce-rhel9@sha256:519c4d77a3a0c1bc85a7d7ef5218544e447ca8fd3708e03faca8eb183667f534
          Port:           3000/TCP
          Host Port:      0/TCP
          State:          Running
            Started:      Sat, 15 Nov 2025 00:50:02 +0000
          Last State:     Terminated
            Reason:       Error
            Exit Code:    137
            Started:      Sat, 15 Nov 2025 00:47:32 +0000
            Finished:     Sat, 15 Nov 2025 00:50:02 +0000
          Ready:          True
          Restart Count:  7
          Requests:
            cpu:      3m
            memory:   40Mi
          Liveness:   http-get https://:3000/livenessProbe delay=10s timeout=10s period=10s #success=1 #failure=3
          Readiness:  http-get https://:3000/readinessProbe delay=0s timeout=10s period=10s #success=1 #failure=3
          Environment:
            PORT:             3000
            CLUSTER_API_URL:  https://kubernetes.default.svc:443
          Mounts:
            /app/certs from console-mce-console-certs (rw)
            /app/config from console-mce-console-mce-config (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wk6q5 (ro)
      Conditions:
        Type                        Status
        PodReadyToStartContainers   True 
        Initialized                 True 
        Ready                       True 
        ContainersReady             True 
        PodScheduled                True 
      Volumes:
        console-mce-console-certs:
          Type:        Secret (a volume populated by a Secret)
          SecretName:  console-mce-console-certs
          Optional:    false
        console-mce-console-mce-config:
          Type:      ConfigMap (a volume populated by a ConfigMap)
          Name:      console-mce-config
          Optional:  false
        kube-api-access-wk6q5:
          Type:                    Projected (a volume that contains injected data from multiple sources)
          TokenExpirationSeconds:  3607
          ConfigMapName:           kube-root-ca.crt
          Optional:                false
          DownwardAPI:             true
          ConfigMapName:           openshift-service-ca.crt
          Optional:                false
      QoS Class:                   Burstable
      Node-Selectors:              <none>
      Tolerations:                 node-role.kubernetes.io/infra:NoSchedule op=Exists
                                   node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                   node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
      Events:
        Type     Reason     Age                  From     Message
        ----     ------     ----                 ----     -------
        Warning  Unhealthy  66m                  kubelet  Liveness probe failed: Get "https://[fd01:0:0:3::31]:3000/livenessProbe": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
        Warning  Unhealthy  64m (x2 over 80m)    kubelet  Readiness probe failed: Get "https://[fd01:0:0:3::31]:3000/readinessProbe": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
        Warning  Unhealthy  63m                  kubelet  Readiness probe failed: Get "https://[fd01:0:0:3::31]:3000/readinessProbe": read tcp [fd01:0:0:3::2]:51308->[fd01:0:0:3::31]:3000: read: connection reset by peer
        Warning  Unhealthy  61m                  kubelet  Readiness probe failed: Get "https://[fd01:0:0:3::31]:3000/readinessProbe": dial tcp [fd01:0:0:3::31]:3000: connect: connection refused
        Warning  Unhealthy  45m (x19 over 77m)   kubelet  Readiness probe failed: Get "https://[fd01:0:0:3::31]:3000/readinessProbe": context deadline exceeded
        Normal   Killing    26m (x7 over 64m)    kubelet  Container console failed liveness probe, will be restarted
        Normal   Pulled     26m (x7 over 63m)    kubelet  Container image "registry.redhat.io/multicluster-engine/console-mce-rhel9@sha256:519c4d77a3a0c1bc85a7d7ef5218544e447ca8fd3708e03faca8eb183667f534" already present on machine
        Normal   Created    26m (x7 over 63m)    kubelet  Created container: console
        Normal   Started    26m (x7 over 63m)    kubelet  Started container console
        Warning  Unhealthy  15m (x32 over 64m)   kubelet  Liveness probe failed: Get "https://[fd01:0:0:3::31]:3000/livenessProbe": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
        Warning  Unhealthy  5m8s (x42 over 66m)  kubelet  Readiness probe failed: Get "https://[fd01:0:0:3::31]:3000/readinessProbe": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
        Warning  Unhealthy  9s (x38 over 77m)    kubelet  Liveness probe failed: Get "https://[fd01:0:0:3::31]:3000/livenessProbe": context deadline exceeded

      Version-Release number of selected component (if applicable):

      OCP - 4.20.2

      Deployed OCP - 4.20.2

      ACM - 2.15.0-DOWNSTREAM-2025-10-29-01-15-32

      AAP - aap-operator.v2.6.0-0.1762261209

      How reproducible:

      This occured in all scale tests with 3500+ managed clusters with AAP and MCGH installed and configured.

      Steps to Reproduce:

      1.  
      2.  
      3. ...

      Actual results:

      Expected results:

      Additional info:

              rh-ee-kcormier Kevin Cormier
              akrzos@redhat.com Alex Krzos
              David Huynh David Huynh
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: