Uploaded image for project: 'Machine Config Operator'
  1. Machine Config Operator
  2. MCO-816

Graceful build failure recovery (un-wedge buildController)

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • None
    • 8
    • False
    • None
    • False
    • OCPSTRAT-1389 - On Cluster Layering: Phase 3 (GA)
    • MCO Sprint 259, MCO Sprint 260
    • 0
    • 0.0

      Currently, when an on-cluster build fails, there is no easy way to clear the failed build status and objects so that another build can be performed. In this state, a cluster admin cannot perform any additional on-cluster builds for that MachineOSConfig until the build failure condition is cleared. Currently, the only way to do that is to delete the MachineOSConfig and recreate it, which is disruptive and undesirable. Instead, an alternative mechanism should be used.

      Overall Flow

      1. The cluster admin adds a label / annotation (e.g., machineconfiguration.openshift.io/force-rebuild) to the MachineOSConfig.
      2. The BuildController will enter its sync loop and perform the following operations:
        1. Delete all ephemeral build objects such as ConfigMaps and / or Secrets as well as the build pods themselves.
        2. Delete the MachineOSBuild associated with the current build.
        3. Restart the build process.
      3. Once the build process has been restarted, BuildController will clear the rebuild label / annotation from the MachineOSConfig object.

      Implementation Details

      Implementation Details

      • Appropriate unit tests / e2e tests are written for the chosen implementation.
      • Detection of whether the build is retryable due to a failure is not in-scope for this issue.

            zzlotnik@redhat.com Zack Zlotnik
            zzlotnik@redhat.com Zack Zlotnik
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: