-
Task
-
Resolution: Unresolved
-
Normal
-
None
1. Fix link in alter
In alert:
Link to grafana dashboard should be https://grafana.stage.devshift.net/d/TjP_nMWMk/operations-rbac-w-cpu?orgId=1&from=now-24h&to=now&timezone=UTC&var-datasource=PDD8BE47D10408F45&var-Endpoint=$__all&var-Namespace=rbac-stage
for stage and accordingly for prod.
now I can see prod link in alter for stage. Alert: https://redhat-internal.slack.com/archives/C05LRFL650V/p1764066567648579
2. Reduce panels in grafana:
Keep only:
RBAC Consumer-> Relations Excessive lag events (>0.02 rate diff for 10m)
RBAC Consumer -> Relations: Replication event creation rate minus total consumer successful "relations" processed
and convert :
RBAC Replication Event Count
and
Consumer Relations Message
to display just numbers like Replication events increase and Sink event increase in https://grafana.stage.devshift.net/d/ce3ty1vy1gpvkd/kessel-relations-api-data-sync?orgId=1&var-Datasource=PDD8BE47D10408F45&from=now-30m&to=now&timezone=browser
- use also sum(increase ...
3. Add panel for Add replication event latency metrics
which were added here - https://github.com/RedHatInsights/insights-rbac/pull/2231
- panels for
- # Average replication latency over 5 minutes: rate(rbac_replication_event_latency_seconds_sum[5m]) / rate(rbac_replication_event_latency_seconds_count[5m]) (one number in panel)
- # 95th percentile latency: histogram_quantile(0.95, rate(rbac_replication_event_latency_seconds_bucket[5m])) - graph
- # 99th percentile latency by event type: histogram_quantile(0.99, sum(rate(rbac_replication_event_latency_seconds_bucket[5m])) by (le, event_type)) - graph