Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-77043

Restore wave execution order breaks when using 10+ waves

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.20.z
    • LCA operator
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      When using 10 or more restore waves during an IBU upgrade, the execution order does not match the lca.openshift.io/apply-wave annotation values.
      
      The LCA correctly sorts and exports the Restore CRs into numbered subdirectories (restore1, restore2, ..., restore11) during the pre-pivot phase. However, after the reboot, it reads them back using os.ReadDir which sorts alphabetically. This causes restore10 and restore11 to be read before restore2, breaking the intended order.
      
      Relevant code:
      * ExportRestoresToDir: https://github.com/openshift-kni/lifecycle-agent/blob/release-4.20/internal/backuprestore/restore.go#L185-L192
      * LoadGroupedManifestsFromPath: https://github.com/openshift-kni/lifecycle-agent/blob/release-4.20/utils/utils.go#L457

      Version-Release number of selected component (if applicable):

      lifecycle-agent v4.20.1 (verified on release-4.20 branch, but the issue exists in all versions as the code has not changed)

      How reproducible:

      Always, when using 10 or more distinct {{lca.openshift.io/apply-wave}} values in Restore CRs.

      Steps to Reproduce:

          1. Define 11 Restore CRs with lca.openshift.io/apply-wave values from 1 to 11
          2. Trigger an IBU upgrade
          3. After the reboot, observe the order in which restores are created in the LCA logs

      Actual results:

      The restores are executed in the following order (based on alphabetical sorting of directory names):
      restore1 → apply-wave 1
      restore10 → apply-wave 10
      restore11 → apply-wave 11
      restore2 → apply-wave 2
      restore3 → apply-wave 3
      ...
      restore9 → apply-wave 9
      
      Waves 10 and 11 are executed before wave 2.

      Expected results:

      Restores should be executed in the order defined by the lca.openshift.io/apply-wave annotation: 1, 2, 3, ..., 10, 11.

      Additional info:

      This issue was discovered during IBU upgrade testing (OCP 4.18 to 4.20) with 11 restore waves. In this case it did not cause functional issues because the restores were independent, but it could lead to failures if there are dependencies between waves that rely on the execution order.

              jche@redhat.com Jun Chen
              dmunneor1@redhat.com Daniel Munne Ortega
              None
              None
              Yang Liu Yang Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: