Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-56158

Excessive API calls by prometheus-operator ServiceAccount

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • 4.20.0
    • 4.16.z
    • Monitoring
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • MON Sprint 270, MON Sprint 271, MON Sprint 272, MON Sprint 274, MON Sprint 275, MON Sprint 276, MON Sprint 277
    • 7
    • Done
    • Bug Fix
    • Hide
      Fix unnecessary API calls during secret updates::
      Before this update, when a new secret was created or updated in any namespace, Alertmanager was reconciling even if that secret was not referred in the `AlertmanagerConfig` resource. As a consequence, the Prometheus Operator generated excessive API calls, causing increased CPU usage on control plane nodes. With this release, Alertmanager only reconciles secrets that the `AlertmanagerConfig` resource explicitly references.
      +
      link:https://issues.redhat.com/browse/OCPBUGS-56158[OCPBUGS-56158]
      Show
      Fix unnecessary API calls during secret updates:: Before this update, when a new secret was created or updated in any namespace, Alertmanager was reconciling even if that secret was not referred in the `AlertmanagerConfig` resource. As a consequence, the Prometheus Operator generated excessive API calls, causing increased CPU usage on control plane nodes. With this release, Alertmanager only reconciles secrets that the `AlertmanagerConfig` resource explicitly references. + link: https://issues.redhat.com/browse/OCPBUGS-56158 [ OCPBUGS-56158 ]
    • None
    • None
    • None
    • None

      Description of problem

      In the cluster audit logs, the customer is observing that over a 24h window, there are over 3.6 million GET requests for a certain Secret from "system:serviceaccount:openshift-user-workload-monitoring:prometheus-operator". This roughly translates into 500 GET requests every 30 seconds.

      In parallel we can see regular "sync alertmanager" messages in the logs around every 30 seconds, but we are not sure this is related:

      level=info ts=2025-05-12T12:49:21.739161413Z caller=operator.go:572 component=alertmanager-controller key=openshift-user-workload-monitoring/user-workload msg="sync alertmanager"
      level=info ts=2025-05-12T12:49:25.785042318Z caller=operator.go:471 component=thanos-controller key=openshift-user-workload-monitoring/user-workload msg="sync thanos-ruler"
      level=info ts=2025-05-12T12:50:04.752668355Z caller=operator.go:572 component=alertmanager-controller key=openshift-user-workload-monitoring/user-workload msg="sync alertmanager"
      level=info ts=2025-05-12T12:50:49.552780038Z caller=operator.go:572 component=alertmanager-controller key=openshift-user-workload-monitoring/user-workload msg="sync alertmanager"
      level=info ts=2025-05-12T12:51:23.461977084Z caller=operator.go:572 component=alertmanager-controller key=openshift-user-workload-monitoring/user-workload msg="sync alertmanager"

      Describe the impact to you or the business

      There is significant load placed on the Kubernetes API by the User Workload Monitoring Prometheus Operator, requiring the customer to assign more CPU than expected to the Master Nodes.

      Version-Release number of selected component (if applicable)

      OCP 4.16.36

      How reproducible

      Constant on the customer cluster

      Steps to Reproduce

      1. Enable User Workload Monitoring
      2. Enable user-defined AlertmanagerConfig, reference a Secret in the configuration
      3. Observe the API audit logs

      Actual results

      Observe that for each AlertmanagerConfig there are a lot of API GET request for referenced Secrets

      Expected results

      Only a limited amount of GET requests are made to the Kubernetes API

      Additional info

      • Logs and inspect files are available in the attached Support Case

              janantha@redhat.com Jayapriya Pai
              rhn-support-skrenger Simon Krenger
              None
              None
              Junqi Zhao Junqi Zhao
              Eliska Romanova Eliska Romanova
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: