Loading...

XML

Word

Printable

Type: Task
Resolution: Obsolete
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: System
Labels:
- icebox

3Scale PT Tested upstream:
Not Started
3scale PT Docs:
Not Started
3scale PT Product Specs:
Not Started
3scale PT Product Update Ready:
Not Started
3scale PT Released In Saas:
Not Started
3scale PT Verified Product:
Not Started
Target Release:

SaaS

Sprint:
3scale 2019-08-12, 3scale 2019-08-26, 3scale 2019-09-30

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

NO QA NEEDED

Right now system app logs everything to bugsnag (an errror monitoring tool), and when there are errors that are perfectly normal and expected, it triggers a "spike detected on system (and also for zync)" alert that trigger a pagerdutty alert calling the on-call engineer
On-call people can not do anything, it is a simple, even expected error, so it is an unnecessary call that is avoidable
The idea is to log only important traces (a subset of the current one), so the on-call guy only gonna be woken up if really needed

Pasted chat
Sergio Lopez Today at 10:52 AM
@here everything ok? I've just received a call from pagerdutty regarding that spike

Gui Cassolato
This CannotUpdateFriendlyIdException is usually due to concurrency in RDS, @slopez . Sidekiq will retry.

Sergio Lopez
thanks @gui! Regarding yesterday conversation, it should be ideal to open a jira issue to change the way bugsnag is being used at the moment (from track everything there, to use it only for real errors that on-call people need to be paged)

Gui Cassolato 1 hour ago
Please do it, @slopez. I’ll make sure the team understands the importance once we it gets selected for estimating.

Meeting summary

Sergio, hramihaj, mnoyabon, duduribeiro):

Bugsnag errors are code application errors that 99% can only be fixed by dev team, because if there is an infra issue, ops on-call team will receive the alert from other different methods)
At the moment all system errors are logged into bugsnag (even errors that are expected like zync timeouts, and sidekiq retries)
So many errors are causing avoidable calls to on-call (that's the reason why pagerdutty notification have been already disabled)
So many errors are causing bugsnag service to sample errors (we pay for it and we have some limits), making that maybe real problems are not logged because having reached the service limit
Temporarily some errors can be muted at bugsnag level
But the idea is to solve the root case, so no log into bugsnag anything that it is not a real error
When bugsnag logging of real errors gets fixed (so having few but important errors only), we will decide next step (leave errors logged in slack to be fixed by dev team on workhours, have someone from dev team receiving that bugnsag error calls in order to fix them, or notify ops-team again)

Assignee:: Unassigned

Reporter:: Catherine Bartlett

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2019/08/01 6:21 AM

Updated:: 2024/01/30 4:50 PM

Resolved:: 2024/01/30 4:50 PM

Details

Description

Meeting summary

Attachments

Easy Agile Planning Poker

Activity

People

Dates