-
Bug
-
Resolution: Unresolved
-
Major
-
ACM 2.12.0, ACM 2.12.1
-
1
-
False
-
None
-
False
-
-
-
Observability Sprint 33, Observability Sprint 34
-
Moderate
-
None
Description of problem:
For some, currently unknown reason, during upgrade from ACM2.11->ACM2.12 we use the proxy image from the base templates. This seem to happen only for a short while, on later reconciles we appear get the correct image from the OCP imagestream.
In disconnected systems, this causes problems due to the image not being available. For rbac-query-proxy and Grafana it eventually resolves with the correct image used, however alertmanager uses a statefulset and gets stuck in the bad state due the known Kubernetes issue described here: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#forced-rollback . This results in the alertmanager-2 pod being left in crashloop and the two other pods (alertmanager-1 and alertmanager-0) to use the 2.11 image versions. Scaling the statefulset down to n-1 resolves the problem (and automatically scales the statefulset back up to the desired amount, 3 by default).
Version-Release number of selected component (if applicable):
ACM 2.12
How reproducible:
Always
Steps to Reproduce:
- Install ACM 2.11 in a disconnected environment and enable observability
- Upgrade to ACM 2.12
Alternatively - Don't install in a disconnected environment but keep a close eye during upgrade on the alertmanager-2 pod, and see that it's trying to pull the quay.io/stolostron/origin-oauth-proxy image for the proxy container
Actual results:
- Alertmanager-2 pod in crashloop and other pods continuing to use the old images
Expected results:
- All alertmanager replicas are healthy using the correct 2.12 images.
Workaround:
oc scale statefulset observability-alertmanager -n open-cluster-management-observability --replicas=2
note: set replicas to n-1 if number of replicas for the statefulset has been changed from the default (default: 3)