NO QA NEEDED
Right now system app logs everything to bugsnag (an errror monitoring tool), and when there are errors that are perfectly normal and expected, it triggers a "spike detected on system (and also for zync)" alert that trigger a pagerdutty alert calling the on-call engineer
On-call people can not do anything, it is a simple, even expected error, so it is an unnecessary call that is avoidable
The idea is to log only important traces (a subset of the current one), so the on-call guy only gonna be woken up if really needed
Pasted chat
Sergio Lopez Today at 10:52 AM
@here everything ok? I've just received a call from pagerdutty regarding that spike
Gui Cassolato
This CannotUpdateFriendlyIdException is usually due to concurrency in RDS, @slopez . Sidekiq will retry.
Sergio Lopez
thanks @gui! Regarding yesterday conversation, it should be ideal to open a jira issue to change the way bugsnag is being used at the moment (from track everything there, to use it only for real errors that on-call people need to be paged)
Gui Cassolato 1 hour ago
Please do it, @slopez. I’ll make sure the team understands the importance once we it gets selected for estimating.
Meeting summary
Sergio, hramihaj, mnoyabon, duduribeiro):
- Bugsnag errors are code application errors that 99% can only be fixed by dev team, because if there is an infra issue, ops on-call team will receive the alert from other different methods)
- At the moment all system errors are logged into bugsnag (even errors that are expected like zync timeouts, and sidekiq retries)
- So many errors are causing avoidable calls to on-call (that's the reason why pagerdutty notification have been already disabled)
- So many errors are causing bugsnag service to sample errors (we pay for it and we have some limits), making that maybe real problems are not logged because having reached the service limit
- Temporarily some errors can be muted at bugsnag level
- But the idea is to solve the root case, so no log into bugsnag anything that it is not a real error
- When bugsnag logging of real errors gets fixed (so having few but important errors only), we will decide next step (leave errors logged in slack to be fixed by dev team on workhours, have someone from dev team receiving that bugnsag error calls in order to fix them, or notify ops-team again)