-
Story
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
Product / Portfolio Work
-
False
-
-
False
-
None
-
None
-
None
-
None
-
None
I've been thinking about how we could enable intra-job test retries without tooling changes, i.e. how we could "rescue" from flakes on presubmits now and avoid the retest struggle, without confusing our tools. I came up with https://github.com/openshift/origin/pull/30223Adds retry strategies to origin: * none (no retries)
- once (current behavior)
- aggressive
In aggressive: * If test passes, we are done.
- If it fails, and the first run was short enough (currently 2 min) we retry up to 10 times
- If 4 or more failures, we produce a single failure artifact with all the outputs, and its considered a true failure
- If less than 4, we leave the 10 results and it gets considered a flake by spyglass, sippy, etc.
30223 is running through jobs on the latest version, but earlier ones were very successful. It rescued many presubmit jobs from failure, like this one
For now, we could leave periodics as "once" so as not to change the kind of data we're giving to tools like CR. Eventually we could make CR, sippy, etc aware of this and analyze tests as discrete results, but as Justin's doc notes, that is a lot of work.However, we could enable this on presubmits pretty safely. The potential for reducing retests is pretty high here, with only a marginal chance we introduce a regression.