Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 4.21.z
Affects Version/s: 4.16
Component/s: Monitoring
Labels:
- monitoring

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
None

Target Backport Versions:
None
Target Version:

4.21.z
Release Blocker:
None
Sprint:
MON Sprint 283, MON Sprint 284
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
[NOTE]: When the Security Level is "Red Hat Employee (Red Hat Employee and Contractors only)", the Release Note content is not included in the doc.

Before this update, the user workload Prometheus Operator did not validate the webhookURL secret reference in the MSTeams receiver configuration of the AlertmanagerConfig custom resource. As a consequence, an invalid or missing webhookURL secret could be accepted, causing the user workload Alertmanager to crash at runtime. With this update, the user workload Prometheus Operator validates the webhookURL secret for MSTeams receivers, rejecting invalid configurations before they can affect Alertmanager.

Show
[NOTE]: When the Security Level is "Red Hat Employee (Red Hat Employee and Contractors only)", the Release Note content is not included in the doc. Before this update, the user workload Prometheus Operator did not validate the webhookURL secret reference in the MSTeams receiver configuration of the AlertmanagerConfig custom resource. As a consequence, an invalid or missing webhookURL secret could be accepted, causing the user workload Alertmanager to crash at runtime. With this update, the user workload Prometheus Operator validates the webhookURL secret for MSTeams receivers, rejecting invalid configurations before they can affect Alertmanager.

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue OCPBUGS-67303. The following is the description of the original issue:
—
Description of problem:

- Incorrect AlertManagerConfig causing Monitoring Operator Degradation, Facing the below issue:
UpdatingUserWorkloadAlertmanager: waiting for Alertmanager User Workload object changes failed: waiting for Alertmanager openshift-user-workload-monitoring/user-workload: context deadline exceeded: expected 1 replicas, got 0 updated replicas

- When running OCP 4.16.50, experiencing a persistent, high-impact issue where a single user-defined AlertmanagerConfig that references a non-existent Kubernetes Secret causes the Prometheus Operator to fail configuration generation, blocking the reconciliation of all other valid AlertmanagerConfigs in the user-workload monitoring stack (openshift-user-workload-monitoring).

Version-Release number of selected component (if applicable):

    Openshift cluster version - 4.16.50 
   

This specific failure mode (**missing Secret**) persists even with the backported fix and in newer upstream Prometheus Operator versions (tested up to **0.85.0** upstream [here](https://gss--c.vf.force.com/apex/support#/cases/04323140/caseupdates/a0aHn00000a2B3ZIAU), linked to GitHub Issue **#7264** [here](https://gss--c.vf.force.com/apex/support#/cases/04323140/caseupdates/a0aHn00000a2AI4IAM)). The operator's Secret retrieval failure halts the entire configuration generation process, regardless of the URL validation fix.

How reproducible:

Always

YAML to use:

apiVersion: monitoring.coreos.com/v1beta1
kind: AlertmanagerConfig
metadata:
  name: example
  namespace: noodles
spec:
  route:
    groupBy:
    - namespace
    receiver: msteams
  receivers:
  - name: msteams
    msteamsConfigs:
    - webhookUrl: 
        key: url # 
        name: my-workflow-webhook # k8s secret name in same namespace as AlertManagerConfig
      sendResolved: true
      title: "mytitle"
      text: "mytext"

Steps to Reproduce:

- Enable User Workload Monitoring
- Enable custom AlertManager configs
- Create a custom AlertManager config
- Check Prometheus Operator pod logs in User Workload Monitoring namespace to verify we see similar error
- Attempt to upgrade

Actual results:

$ oc logs prometheus-operator-85fcd7664-xfstm | grep webhook | tail -n1
ts=2025-12-10T07:26:57.027512738Z level=error caller=/go/src/github.com/coreos/prometheus-operator/pkg/operator/resource_reconciler.go:557 msg="Unhandled Error" logger=UnhandledError err="sync \"openshift-user-workload-monitoring/user-workload\" failed: provision alertmanager configuration: failed to generate Alertmanager configuration: AlertmanagerConfig noodles/example: MSTeamsConfig[0]: unable to get secret \"my-workflow-webhook\": secrets \"my-workflow-webhook\" not found"

Then during upgrade:
$ oc get co
...
monitoring                                 4.19.19   False       True          True       2m      UpdatingUserWorkloadAlertmanager: waiting for Alertmanager User Workload object changes failed: waiting for Alertmanager openshift-user-workload-monitoring/user-workload: context deadline exceeded: expected 1 replicas, got 0 updated replicas
...

Expected results:

The monitoring operator should be in the available status and a non-existing secret should not impact on the AlertManagerConfig reconciliation.

Additional info:

blocks

OCPBUGS-77190 [release-4.20] Incorrect AlertManagerConfig causing Monitoring operator degraded

Verified

clones

OCPBUGS-67303 Incorrect AlertManagerConfig causing Monitoring operator degraded

Verified

is blocked by

OCPBUGS-67303 Incorrect AlertManagerConfig causing Monitoring operator degraded

Verified

is cloned by

OCPBUGS-77190 [release-4.20] Incorrect AlertManagerConfig causing Monitoring operator degraded

Verified

links to

openshift/prometheus-operator#359: [release-4.21] OCPBUGS-74477: Validate `webhookURL` secret for `MSTeams` receiver in `AlertmanagerConfig` CRD

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates