Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-23537

[Backport 4.12.z] Agent should retry with exponential backoff even on seemingly irrecoverable errors

    XMLWordPrintable

Details

    • No
    • Sprint 244, Sprint 246
    • 2
    • False
    • Hide

      None

      Show
      None

    Description

      This is a clone from MGMT-11551, to backport the PR https://github.com/openshift/assisted-installer-agent/pull/438 and https://github.com/openshift/assisted-installer-agent/pull/442 into agent-based-installer 4.12.z.

      Description of the problem:

      Currently, in 4.12.z, when the agent encounters seemingly irrecoverable errors it sleeps forever

      This is not ideal because we're not truly confident that those errors are truly irrecoverable, and retrying might save the day. To avoid generating too much noise from such agents, the retry delay algorithm should use exponential back off.

      How reproducible:

      Single occurrence on last few months, while running ~100 installation jobs per weekend in an CI pipelines.

      Steps to reproduce:

      1. N/A, potential race condition

      2.

      3.

      Actual results:

      • Failed to bootstrap

      Expected results:

      • Successful installation.

      Attachments

        Issue Links

          Activity

            People

              bfournie@redhat.com Robert Fournier
              rhn-support-arolivei Arthur de Oliveira
              zhenying niu zhenying niu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: