-
Story
-
Resolution: Done
-
Normal
-
None
-
None
-
None
-
False
-
False
-
-
Today we fail the job if you're over the P99 for the last 3 weeks, as determined by a weekly pr to origin. The mechanism for creating that pr, reading it's data, and running these tests, is broken repeatedly without anyone realizing, and often doing things we don't expect.
Disruption ebbs and flows constantly especially at the 99th percentile, the test being run is not technically the same week to week.
We do still want to at least attempt to fail a job run if disruption was significant.
Examples we would not want to fail:
P99 2s, job got 4s. We don't care about a few seconds of disruption at this level. This happens all the time, and it's not a regression. Stop failing the test.
P99 60s, job got 65s. There's already huge disruption possible, a few seconds over is largely irrelevant week to week. Stop failing the test.
Layer 3 disruption monitoring is now our main focus point for catching more subtle regressions, this is just a first line of defence, best attempt at telling a PR author that your change may have caused a drastic disruption problem.
Changes proposed (see details in comments below for the data as to why):
- Allow P99 + grace where grace is 5s or 10%.
- Remove all fallbacks to historical data except based on release version. (or all of them, and shut down testing while data is accumulating for a new release?)
- Disable all testing with empty Platform, whatever these are we should not be testing on it.
- Fix the broken test names, disruptionlegacyapiservers monitortest.go testNames() is returning the same test name for new/reused connection testing and likely causing flakes when it should be causing failures.
- Standardize all disruption test names to something we can easily search for, see this link.
- Recheck data if we focus on jobs that ONLY failed on disruption, and see if a more lenient grace would achieve the effects we want there.