Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-15525

On upgrades oauth images transiently use template image causing crashloop in disconnected enviorments

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • ACM 2.12.2
    • ACM 2.12.0, ACM 2.12.1
    • Observability
    • 1
    • False
    • None
    • False
    • Observability Sprint 33, Observability Sprint 34
    • Moderate
    • None

      Description of problem:

      For some, currently unknown reason, during upgrade from ACM2.11->ACM2.12 we use the proxy image from the base templates. This seem to happen only for a short while, on later reconciles we appear get the correct image from the OCP imagestream.

      In disconnected systems, this causes problems due to the image not being available. For rbac-query-proxy and Grafana it eventually resolves with the correct image used, however alertmanager uses a statefulset and gets stuck in the bad state due the known Kubernetes issue described here: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback . This results in the alertmanager-2 pod being left in crashloop and the two other pods (alertmanager-1 and alertmanager-0) to use the 2.11 image versions. Scaling the statefulset down to n-1 resolves the problem (and automatically scales the statefulset back up to the desired amount, 3 by default).

      Version-Release number of selected component (if applicable):

      ACM 2.12

      How reproducible:

      Always

      Steps to Reproduce:

      1.  Install ACM 2.11 in a disconnected environment and enable observability
      2.  Upgrade to ACM 2.12
        Alternatively
      3. Don't install in a disconnected environment but keep a close eye during upgrade on the alertmanager-2 pod, and see that it's trying to pull the quay.io/stolostron/origin-oauth-proxy image for the proxy container

      Actual results:

      • Alertmanager-2 pod in crashloop and other pods continuing to use the old images

      Expected results:

      • All alertmanager replicas are healthy using the correct 2.12 images.

      Workaround:

      oc scale statefulset observability-alertmanager -n open-cluster-management-observability --replicas=2
      

      note: set replicas to n-1 if number of replicas for the statefulset has been changed from the default (default: 3)

              rh-ee-jachanse Jacob Baungard Hansen
              rh-ee-jachanse Jacob Baungard Hansen
              Xiang Yin Xiang Yin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: