-
Bug
-
Resolution: Done
-
Normal
-
None
-
None
-
Quality / Stability / Reliability
-
False
-
-
False
-
None
-
Unset
-
None
-
-
-
The user_provider_get_users metrics are including the org-id as a label. This is driving up the cardinality count for these metrics and causing issues for the prometheus / grafana servers. This triggered an app-interface/app-sre alert:
https://redhat-internal.slack.com/archives/CCRND57FW/p1770395587714289
The issue appears to have been triggered by an surge in messages from RBAC on 02/02/2026. This caused the number of user_provider_get_users_* metrics to be unique for each org. The count continued to climb unit it reached 50k which triggered an alert.
Had we redeployed this week, then the metric would have reset and we likely would not have triggered the alert and would have missed this issue.
This looks like where the metric is used: https://github.com/RedHatInsights/notifications-backend/blob/master/recipients-resolver/src/main/java/com/redhat/cloud/notifications/recipients/resolver/FetchUsersFromExternalServices.java#L185
Revert the change that dropped the problematic metrics:
https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/resources/insights-prod/notifications-prod/service-monitor/notifications-recipients-resolver.servicemonitor.yml?ref_type=heads#L13-16
- is caused by
-
RHCLOUD-44724 Fix the notifications for changes of system roles
-
- Backlog
-
- mentioned on