Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-2288

Aggressive intra-job test retries on presubmits

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • Product / Portfolio Work
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None
    • None
    • None

      I've been thinking about how we could enable intra-job test retries without tooling changes, i.e. how we could "rescue" from flakes on presubmits now and avoid the retest struggle, without confusing our tools.  I came up with https://github.com/openshift/origin/pull/30223Adds retry strategies to origin: * none (no retries)

      • once (current behavior)
      • aggressive

      In aggressive: * If test passes, we are done.

      • If it fails, and the first run was short enough (currently 2 min) we retry up to 10 times
        • If 4 or more failures, we produce a single failure artifact with all the outputs, and its considered a true failure
        • If less than 4, we leave the 10 results and it gets considered a flake by spyglass, sippy, etc.

       
      30223 is running through jobs on the latest version, but earlier ones were very successful.  It rescued many presubmit jobs from failure, like this one
       
      For now, we could leave periodics as "once" so as not to change the kind of data we're giving to tools like CR.  Eventually we could make CR, sippy, etc aware of this and analyze tests as discrete results, but as Justin's doc notes, that is a lot of work.However, we could enable this on presubmits pretty safely. The potential for reducing retests is pretty high here, with only a marginal chance we introduce a regression.

              stbenjam Stephen Benjamin
              stbenjam Stephen Benjamin
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: