-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
RHODS_1.23.0_GA
-
False
-
None
-
False
-
Testable
-
No
-
-
-
-
-
-
-
No
-
No
-
Pending
-
None
-
-
Description of problem:
Pagerduty alerts received by SRE team does not include the cluster_id in the alert. It does not include any details about the cluster which the alert was triggered for.
Following prometheus configuration https://github.com/red-hat-data-services/odh-deployer/blob/main/monitoring/prometheus/prometheus-configs.yaml needs to be update.
Following is a current alert received by SRE:
Labels: - alertname = RHODS Dashboard Probe Success Burn Rate - name = rhods-dashboard - severity = critical Annotations: - message = High error budget burn for (current value: 0.09999999999999998). - summary = RHODS Dashboard Probe Success Burn Rate - triage = https://gitlab.cee.redhat.com/service/managed-tenants-sops/-/blob/main/RHODS/Jupyter/rhods-dashboard-probe-success-burn-rate.md Source: http://prometheus-6f855b778d-jk4sk:9090/graph?..............................................................................
Note that in the above alert SOP link is broken. All the SOP links in all the alerts has to be checked to make sure they are not broken.
Acceptance Criteria:
- Include the details about the cluster in the alerts(specially cluster_id) so that the SRE members can uniquely identify the cluster.
- For all the alerts, SOP are not broken.