-
Bug
-
Resolution: Unresolved
-
Critical
-
4.16
Description of problem:
- Incorrect AlertManagerConfig causing Monitoring Operator Degradation, Facing the below issue: UpdatingUserWorkloadAlertmanager: waiting for Alertmanager User Workload object changes failed: waiting for Alertmanager openshift-user-workload-monitoring/user-workload: context deadline exceeded: expected 1 replicas, got 0 updated replicas - When running OCP 4.16.50, experiencing a persistent, high-impact issue where a single user-defined AlertmanagerConfig that references a non-existent Kubernetes Secret causes the Prometheus Operator to fail configuration generation, blocking the reconciliation of all other valid AlertmanagerConfigs in the user-workload monitoring stack (openshift-user-workload-monitoring).
Version-Release number of selected component (if applicable):
Openshift cluster version - 4.16.50
This specific failure mode (**missing Secret**) persists even with the backported fix and in newer upstream Prometheus Operator versions (tested up to **0.85.0** upstream [here](https://gss--c.vf.force.com/apex/support#/cases/04323140/caseupdates/a0aHn00000a2B3ZIAU), linked to GitHub Issue **#7264** [here](https://gss--c.vf.force.com/apex/support#/cases/04323140/caseupdates/a0aHn00000a2AI4IAM)). The operator's Secret retrieval failure halts the entire configuration generation process, regardless of the URL validation fix.
How reproducible:
Always
YAML to use:
apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
name: example
namespace: noodles
spec:
route:
groupBy:
- namespace
receiver: msteams
receivers:
- name: msteams
msteamsConfigs:
- webhookUrl:
key: url #
name: my-workflow-webhook # k8s secret name in same namespace as AlertManagerConfig
sendResolved: true
title: "mytitle"
text: "mytext"
Steps to Reproduce:
- Enable User Workload Monitoring - Enable custom AlertManager configs - Create a custom AlertManager config - Check Prometheus Operator pod logs in User Workload Monitoring namespace to verify we see similar error - Attempt to upgrade
Actual results:
$ oc logs prometheus-operator-85fcd7664-xfstm | grep webhook | tail -n1 ts=2025-12-10T07:26:57.027512738Z level=error caller=/go/src/github.com/coreos/prometheus-operator/pkg/operator/resource_reconciler.go:557 msg="Unhandled Error" logger=UnhandledError err="sync \"openshift-user-workload-monitoring/user-workload\" failed: provision alertmanager configuration: failed to generate Alertmanager configuration: AlertmanagerConfig noodles/example: MSTeamsConfig[0]: unable to get secret \"my-workflow-webhook\": secrets \"my-workflow-webhook\" not found" Then during upgrade: $ oc get co ... monitoring 4.19.19 False True True 2m UpdatingUserWorkloadAlertmanager: waiting for Alertmanager User Workload object changes failed: waiting for Alertmanager openshift-user-workload-monitoring/user-workload: context deadline exceeded: expected 1 replicas, got 0 updated replicas ...
Expected results:
The monitoring operator should be in the available status and a non-existing secret should not impact on the AlertManagerConfig reconciliation.
Additional info:
- blocks
-
OCPBUGS-74477 [release-4.21] Incorrect AlertManagerConfig causing Monitoring operator degraded
-
- Closed
-
- is cloned by
-
OCPBUGS-74477 [release-4.21] Incorrect AlertManagerConfig causing Monitoring operator degraded
-
- Closed
-
- relates to
-
OCPBUGS-58408 Backport of Prometheus Operator Bug Fix: "One Alertmanager Config failing blocks all others" to OCP 4.17
-
- Closed
-
- links to