Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-74477

[release-4.21] Incorrect AlertManagerConfig causing Monitoring operator degraded

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • 4.21.z
    • 4.16
    • Monitoring
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • None
    • None
    • MON Sprint 283, MON Sprint 284
    • 2
    • In Progress
    • Bug Fix
    • Hide
      [NOTE]: When the Security Level is "Red Hat Employee (Red Hat Employee and Contractors only)", the Release Note content is not included in the doc.

      Before this update, the user workload Prometheus Operator did not validate the webhookURL secret reference in the MSTeams receiver configuration of the AlertmanagerConfig custom resource. As a consequence, an invalid or missing webhookURL secret could be accepted, causing the user workload Alertmanager to crash at runtime. With this update, the user workload Prometheus Operator validates the webhookURL secret for MSTeams receivers, rejecting invalid configurations before they can affect Alertmanager.
      Show
      [NOTE]: When the Security Level is "Red Hat Employee (Red Hat Employee and Contractors only)", the Release Note content is not included in the doc. Before this update, the user workload Prometheus Operator did not validate the webhookURL secret reference in the MSTeams receiver configuration of the AlertmanagerConfig custom resource. As a consequence, an invalid or missing webhookURL secret could be accepted, causing the user workload Alertmanager to crash at runtime. With this update, the user workload Prometheus Operator validates the webhookURL secret for MSTeams receivers, rejecting invalid configurations before they can affect Alertmanager.
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-67303. The following is the description of the original issue:

      Description of problem:

      - Incorrect AlertManagerConfig causing Monitoring Operator Degradation, Facing the below issue:
      UpdatingUserWorkloadAlertmanager: waiting for Alertmanager User Workload object changes failed: waiting for Alertmanager openshift-user-workload-monitoring/user-workload: context deadline exceeded: expected 1 replicas, got 0 updated replicas
      
      - When running OCP 4.16.50, experiencing a persistent, high-impact issue where a single user-defined AlertmanagerConfig that references a non-existent Kubernetes Secret causes the Prometheus Operator to fail configuration generation, blocking the reconciliation of all other valid AlertmanagerConfigs in the user-workload monitoring stack (openshift-user-workload-monitoring).

       

      Version-Release number of selected component (if applicable):

          Openshift cluster version - 4.16.50 
         
      
      This specific failure mode (**missing Secret**) persists even with the backported fix and in newer upstream Prometheus Operator versions (tested up to **0.85.0** upstream [here](https://gss--c.vf.force.com/apex/support#/cases/04323140/caseupdates/a0aHn00000a2B3ZIAU), linked to GitHub Issue **#7264** [here](https://gss--c.vf.force.com/apex/support#/cases/04323140/caseupdates/a0aHn00000a2AI4IAM)). The operator's Secret retrieval failure halts the entire configuration generation process, regardless of the URL validation fix.

      How reproducible:

      Always
      YAML to use:
      
      apiVersion: monitoring.coreos.com/v1beta1
      kind: AlertmanagerConfig
      metadata:
        name: example
        namespace: noodles
      spec:
        route:
          groupBy:
          - namespace
          receiver: msteams
        receivers:
        - name: msteams
          msteamsConfigs:
          - webhookUrl: 
              key: url # 
              name: my-workflow-webhook # k8s secret name in same namespace as AlertManagerConfig
            sendResolved: true
            title: "mytitle"
            text: "mytext" 

      Steps to Reproduce:

      - Enable User Workload Monitoring
      - Enable custom AlertManager configs
      - Create a custom AlertManager config
      - Check Prometheus Operator pod logs in User Workload Monitoring namespace to verify we see similar error
      - Attempt to upgrade     

      Actual results:

      $ oc logs prometheus-operator-85fcd7664-xfstm | grep webhook | tail -n1
      ts=2025-12-10T07:26:57.027512738Z level=error caller=/go/src/github.com/coreos/prometheus-operator/pkg/operator/resource_reconciler.go:557 msg="Unhandled Error" logger=UnhandledError err="sync \"openshift-user-workload-monitoring/user-workload\" failed: provision alertmanager configuration: failed to generate Alertmanager configuration: AlertmanagerConfig noodles/example: MSTeamsConfig[0]: unable to get secret \"my-workflow-webhook\": secrets \"my-workflow-webhook\" not found"
      
      Then during upgrade:
      $ oc get co
      ...
      monitoring                                 4.19.19   False       True          True       2m      UpdatingUserWorkloadAlertmanager: waiting for Alertmanager User Workload object changes failed: waiting for Alertmanager openshift-user-workload-monitoring/user-workload: context deadline exceeded: expected 1 replicas, got 0 updated replicas
      ...

      Expected results:

      The monitoring operator should be in the available status and a non-existing secret should not impact on the AlertManagerConfig reconciliation. 

      Additional info:

          

              janantha@redhat.com Jayapriya Pai
              rhn-support-slakade Shivam Lakade
              Junqi Zhao Junqi Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: