Uploaded image for project: 'Project Quay'
  1. Project Quay
  2. PROJQUAY-1113

Improve SOP and alert reset for QuayBuilderTooManyEc2Builds

    XMLWordPrintable

Details

    • Bug
    • Resolution: Obsolete
    • Major
    • None
    • None
    • quay.io
    • None
    • False
    • False
    • Undefined
    • 0

    Description

      Hello team, 

      We've been recently seeing this alert on the app-sre alerts channel: 

      https://coreos.slack.com/archives/CDW0S85QU/p1601294118096600

      As you can see, the alert is quite spammy and resets very easily.

      I would like to request two actions here from the Quay team: 

      • Please review tune the alert according to your desired SLI's on the builders
      • Please provide the SRE some steps on how to debug and attempt to fix such a scenario. For example, what dashboard should one look at? Should we go check EC2 instances? How can we fix this problem?

      Please feel free to let me know if more info is needed

       

      Until then, I have downgraded this alert to `medium` severity, which means it won't show up in our alerts channel, but instead in #sd-app-sre-quay-info

      It is in the best interest of all our tenants that we keep our alerting channel very high signal and low noise. 

      Happy monitoring!  

      Attachments

        Activity

          People

            sleesinc Kenny Lee Sin Cheong
            akonarde@redhat.com Aditya Konarde
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: