Uploaded image for project: 'OpenShift Over the Air'
  1. OpenShift Over the Air
  2. OTA-769

create alert for conditions that caused 2022-08-25 OSUS incident

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • 5
    • False
    • None
    • False
    • OTA 224

      When degraded OSUS performance occurred on Aug 25 2022, no alerts fired. See this ticket for details: APPSRE-6192

      Let's please create an alert that would have notified teams more quickly of this condition.

      Slack channel : #incident_osus_high_latency_timeout  link: https://coreos.slack.com/archives/C03UQ5U2CP9

      RCA document: link

      Definition of done:

      • Create an alert when policy engine latency is more than 1 seconds.
      • Create an alert when envoy has more than 100 pending requests.

       

            pratikam Pratik Mahajan
            jbeakley+sd-app-sre Jonathan Beakley (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: