Uploaded image for project: 'Hybrid Cloud Console'
  1. Hybrid Cloud Console
  2. RHCLOUD-44952

[notifications] high cardinality issue with user_provider_get_users metrics

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • Unset
    • None

      The user_provider_get_users metrics are including the org-id as a label. This is driving up the cardinality count for these metrics and causing issues for the prometheus / grafana servers. This triggered an app-interface/app-sre alert:

      https://redhat-internal.slack.com/archives/CCRND57FW/p1770395587714289

      The issue appears to have been triggered by an surge in messages from RBAC on 02/02/2026. This caused the number of user_provider_get_users_* metrics to be unique for each org. The count continued to climb unit it reached 50k which triggered an alert.

      Had we redeployed this week, then the metric would have reset and we likely would not have triggered the alert and would have missed this issue.

      This looks like where the metric is used: https://github.com/RedHatInsights/notifications-backend/blob/master/recipients-resolver/src/main/java/com/redhat/cloud/notifications/recipients/resolver/FetchUsersFromExternalServices.java#L185

      Revert the change that dropped the problematic metrics:
      https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/resources/insights-prod/notifications-prod/service-monitor/notifications-recipients-resolver.servicemonitor.yml?ref_type=heads#L13-16

              rh-ee-gduval Guillaume Duval
              rhn-support-dehort Derek Horton
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: