Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Labels:
None

Blocked:
False
Ready:
False
Epic Link:
Disruption Enhancements
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Today we fail the job if you're over the P99 for the last 3 weeks, as determined by a weekly pr to origin. The mechanism for creating that pr, reading it's data, and running these tests, is broken repeatedly without anyone realizing, and often doing things we don't expect.

Disruption ebbs and flows constantly especially at the 99th percentile, the test being run is not technically the same week to week.

We do still want to at least attempt to fail a job run if disruption was significant.

Examples we would not want to fail:

P99 2s, job got 4s. We don't care about a few seconds of disruption at this level. This happens all the time, and it's not a regression. Stop failing the test.

P99 60s, job got 65s. There's already huge disruption possible, a few seconds over is largely irrelevant week to week. Stop failing the test.

Layer 3 disruption monitoring is now our main focus point for catching more subtle regressions, this is just a first line of defence, best attempt at telling a PR author that your change may have caused a drastic disruption problem.

Changes proposed (see details in comments below for the data as to why):

Allow P99 + grace where grace is 5s or 10%.
Remove all fallbacks to historical data except based on release version. (or all of them, and shut down testing while data is accumulating for a new release?)
Disable all testing with empty Platform, whatever these are we should not be testing on it.
Fix the broken test names, disruptionlegacyapiservers monitortest.go testNames() is returning the same test name for new/reused connection testing and likely causing flakes when it should be causing failures.
Standardize all disruption test names to something we can easily search for, see this link.
Recheck data if we focus on jobs that ONLY failed on disruption, and see if a more lenient grace would achieve the effects we want there.

links to

openshift/origin#27674: Change disruption threshold to minimize noises

openshift/origin#28453: TRT-786: Relax the per-job disruption tests and fix bugs/improve consistency

Test renames PR

Assignee:: Devan Goodwin

Reporter:: Devan Goodwin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2023/01/18 3:27 PM

Updated:: 2023/12/18 2:13 PM

Resolved:: 2023/12/18 2:13 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates