Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-7489

Pagerduty alerts is missing cluster identification details

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • RHODS_1.23.0_GA
    • Monitoring
    • False
    • None
    • False
    • Testable
    • No
    • No
    • No
    • Pending
    • None

      Description of problem:

      Pagerduty alerts received by SRE team does not include the cluster_id in the alert. It does not include any details about the cluster which the alert was triggered for. 
      Following prometheus configuration  https://github.com/red-hat-data-services/odh-deployer/blob/main/monitoring/prometheus/prometheus-configs.yaml needs to be update.

      Following is a current alert received by SRE:

      Labels:
       - alertname = RHODS Dashboard Probe Success Burn Rate
       - name = rhods-dashboard
       - severity = critical
      Annotations:
       - message = High error budget burn for  (current value: 0.09999999999999998).
       - summary = RHODS Dashboard Probe Success Burn Rate
       - triage = https://gitlab.cee.redhat.com/service/managed-tenants-sops/-/blob/main/RHODS/Jupyter/rhods-dashboard-probe-success-burn-rate.md
      Source: http://prometheus-6f855b778d-jk4sk:9090/graph?..............................................................................

      Note that in the above alert SOP link is broken. All the SOP links in all the alerts has to be checked to make sure they are not broken.

      Acceptance Criteria: 

      • Include the details about the cluster in the alerts(specially cluster_id) so that the SRE members can uniquely identify the cluster.
      • For all the alerts, SOP are not broken. 

            rh-ee-magautie Max Gautier (Inactive)
            rhn-support-cabeywar Chamal Abeywardhana
            Jorge Garcia Oncins Jorge Garcia Oncins
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated: