Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: 4.22.0
Affects Version/s: 4.16
Component/s: Monitoring
Labels:
- monitoring

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
None

Target Backport Versions:

4.19.z, 4.20.z, 4.21.z
Target Version:

4.22.0
Release Blocker:
None
Sprint:
MON Sprint 282, MON Sprint 283
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
Before this update, the user workload Prometheus Operator did not validate the webhookURL secret reference in the MSTeams receiver configuration of the AlertmanagerConfig custom resource. As a consequence, an invalid or missing webhookURL secret could be accepted, causing the user workload Alertmanager to crash at runtime. With this update, the user workload Prometheus Operator validates the webhookURL secret for MSTeams receivers, rejecting invalid configurations before they can affect Alertmanager.

Show
Before this update, the user workload Prometheus Operator did not validate the webhookURL secret reference in the MSTeams receiver configuration of the AlertmanagerConfig custom resource. As a consequence, an invalid or missing webhookURL secret could be accepted, causing the user workload Alertmanager to crash at runtime. With this update, the user workload Prometheus Operator validates the webhookURL secret for MSTeams receivers, rejecting invalid configurations before they can affect Alertmanager.

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

- Incorrect AlertManagerConfig causing Monitoring Operator Degradation, Facing the below issue:
UpdatingUserWorkloadAlertmanager: waiting for Alertmanager User Workload object changes failed: waiting for Alertmanager openshift-user-workload-monitoring/user-workload: context deadline exceeded: expected 1 replicas, got 0 updated replicas

- When running OCP 4.16.50, experiencing a persistent, high-impact issue where a single user-defined AlertmanagerConfig that references a non-existent Kubernetes Secret causes the Prometheus Operator to fail configuration generation, blocking the reconciliation of all other valid AlertmanagerConfigs in the user-workload monitoring stack (openshift-user-workload-monitoring).

Version-Release number of selected component (if applicable):

    Openshift cluster version - 4.16.50 
   

This specific failure mode (**missing Secret**) persists even with the backported fix and in newer upstream Prometheus Operator versions (tested up to **0.85.0** upstream [here](https://gss--c.vf.force.com/apex/support#/cases/04323140/caseupdates/a0aHn00000a2B3ZIAU), linked to GitHub Issue **#7264** [here](https://gss--c.vf.force.com/apex/support#/cases/04323140/caseupdates/a0aHn00000a2AI4IAM)). The operator's Secret retrieval failure halts the entire configuration generation process, regardless of the URL validation fix.

How reproducible:

Always

YAML to use:

apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  name: example
  namespace: noodles
spec:
  route:
    groupBy:
    - namespace
    receiver: msteams
  receivers:
  - name: msteams
    msteamsConfigs:
    - webhookUrl: 
        key: url # 
        name: my-workflow-webhook # k8s secret name in same namespace as AlertManagerConfig
      sendResolved: true
      title: "mytitle"
      text: "mytext"

Steps to Reproduce:

- Enable User Workload Monitoring
- Enable custom AlertManager configs
- Create a custom AlertManager config
- Check Prometheus Operator pod logs in User Workload Monitoring namespace to verify we see similar error
- Attempt to upgrade

Actual results:

$ oc logs prometheus-operator-85fcd7664-xfstm | grep webhook | tail -n1
ts=2025-12-10T07:26:57.027512738Z level=error caller=/go/src/github.com/coreos/prometheus-operator/pkg/operator/resource_reconciler.go:557 msg="Unhandled Error" logger=UnhandledError err="sync \"openshift-user-workload-monitoring/user-workload\" failed: provision alertmanager configuration: failed to generate Alertmanager configuration: AlertmanagerConfig noodles/example: MSTeamsConfig[0]: unable to get secret \"my-workflow-webhook\": secrets \"my-workflow-webhook\" not found"

Then during upgrade:
$ oc get co
...
monitoring                                 4.19.19   False       True          True       2m      UpdatingUserWorkloadAlertmanager: waiting for Alertmanager User Workload object changes failed: waiting for Alertmanager openshift-user-workload-monitoring/user-workload: context deadline exceeded: expected 1 replicas, got 0 updated replicas
...

Expected results:

The monitoring operator should be in the available status and a non-existing secret should not impact on the AlertManagerConfig reconciliation.

Additional info:

blocks

OCPBUGS-74477 [release-4.21] Incorrect AlertManagerConfig causing Monitoring operator degraded

Closed

is cloned by

OCPBUGS-74477 [release-4.21] Incorrect AlertManagerConfig causing Monitoring operator degraded

Closed

relates to

OCPBUGS-58408 Backport of Prometheus Operator Bug Fix: "One Alertmanager Config failing blocks all others" to OCP 4.17

Closed

links to

openshift/prometheus-operator#358: OCPBUGS-67303: Validate `webhookURL` secret for `MSTeams` receiver in `AlertmanagerConfig` CRD

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates