Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-23537

[Backport 4.12.z] Agent should retry with exponential backoff even on seemingly irrecoverable errors

XMLWordPrintable

    • No
    • Sprint 244, Sprint 246
    • 2
    • False
    • Hide

      None

      Show
      None

      This is a clone from MGMT-11551, to backport the PR https://github.com/openshift/assisted-installer-agent/pull/438 and https://github.com/openshift/assisted-installer-agent/pull/442 into agent-based-installer 4.12.z.

      Description of the problem:

      Currently, in 4.12.z, when the agent encounters seemingly irrecoverable errors it sleeps forever

      This is not ideal because we're not truly confident that those errors are truly irrecoverable, and retrying might save the day. To avoid generating too much noise from such agents, the retry delay algorithm should use exponential back off.

      How reproducible:

      Single occurrence on last few months, while running ~100 installation jobs per weekend in an CI pipelines.

      Steps to reproduce:

      1. N/A, potential race condition

      2.

      3.

      Actual results:

      • Failed to bootstrap

      Expected results:

      • Successful installation.

            bfournie@redhat.com Robert Fournier
            rhn-support-arolivei Arthur Oliveira
            zhenying niu zhenying niu
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: