Uploaded image for project: 'Red Hat 3scale API Management'
  1. Red Hat 3scale API Management
  2. THREESCALE-3155

Log only important stuff in bugsnag for on-call ops people

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Obsolete
    • Icon: Major Major
    • None
    • None
    • System
    • Not Started
    • Not Started
    • Not Started
    • Not Started
    • Not Started
    • Not Started
    • 3scale 2019-08-12, 3scale 2019-08-26, 3scale 2019-09-30

      NO QA NEEDED

      Right now system app logs everything to bugsnag (an errror monitoring tool), and when there are errors that are perfectly normal and expected, it triggers a "spike detected on system (and also for zync)" alert that trigger a pagerdutty alert calling the on-call engineer
      On-call people can not do anything, it is a simple, even expected error, so it is an unnecessary call that is avoidable
      The idea is to log only important traces (a subset of the current one), so the on-call guy only gonna be woken up if really needed

      Pasted chat
      Sergio Lopez Today at 10:52 AM
      @here everything ok? I've just received a call from pagerdutty regarding that spike

      Gui Cassolato
      This CannotUpdateFriendlyIdException is usually due to concurrency in RDS, @slopez . Sidekiq will retry.

      Sergio Lopez
      thanks @gui! Regarding yesterday conversation, it should be ideal to open a jira issue to change the way bugsnag is being used at the moment (from track everything there, to use it only for real errors that on-call people need to be paged)

      Gui Cassolato 1 hour ago
      Please do it, @slopez. I’ll make sure the team understands the importance once we it gets selected for estimating.

      Meeting summary

      Sergio, hramihaj, mnoyabon, duduribeiro):

      • Bugsnag errors are code application errors that 99% can only be fixed by dev team, because if there is an infra issue, ops on-call team will receive the alert from other different methods)
      • At the moment all system errors are logged into bugsnag (even errors that are expected like zync timeouts, and sidekiq retries)
      • So many errors are causing avoidable calls to on-call (that's the reason why pagerdutty notification have been already disabled)
      • So many errors are causing bugsnag service to sample errors (we pay for it and we have some limits), making that maybe real problems are not logged because having reached the service limit
      • Temporarily some errors can be muted at bugsnag level
      • But the idea is to solve the root case, so no log into bugsnag anything that it is not a real error
      • When bugsnag logging of real errors gets fixed (so having few but important errors only), we will decide next step (leave errors logged in slack to be fixed by dev team on workhours, have someone from dev team receiving that bugnsag error calls in order to fix them, or notify ops-team again)

              Unassigned Unassigned
              cbartlet Catherine Bartlett
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: