-
Story
-
Resolution: Done
-
Major
-
None
-
None
-
8
-
False
-
None
-
False
-
OCPSTRAT-1389 - On Cluster Layering: Phase 3 (GA)
-
-
-
MCO Sprint 259, MCO Sprint 260, MCO Sprint 261, MCO Sprint 262
-
0
-
0.000
Currently, when an on-cluster build fails, there is no easy way to clear the failed build status and objects so that another build can be performed. In this state, a cluster admin cannot perform any additional on-cluster builds for that MachineOSConfig until the build failure condition is cleared. Currently, the only way to do that is to delete the MachineOSConfig and recreate it, which is disruptive and undesirable. Instead, an alternative mechanism should be used.
Overall Flow
- The cluster admin adds a label / annotation (e.g., machineconfiguration.openshift.io/force-rebuild) to the MachineOSConfig.
- The BuildController will enter its sync loop and perform the following operations:
- Delete all ephemeral build objects such as ConfigMaps and / or Secrets as well as the build pods themselves.
- Delete the MachineOSBuild associated with the current build.
- Restart the build process.
- Once the build process has been restarted, BuildController will clear the rebuild label / annotation from the MachineOSConfig object.
Implementation Details
- As of https://github.com/openshift/machine-config-operator/pull/4471, there are labels and annotations attached to all ephemeral build objects that identify what MachineOSConfig / MachineOSBuild / etc. they belong to as well as a machineconfiguration.openshift.io/ephemeral-build-object label that explicitly identifies an object as ephemeral. See: https://github.com/cheesesashimi/machine-config-operator/blob/9b501d90ea2cbd5bd2427bea0c7d2cc736796b1c/pkg/controller/build/constants.go for a more complete list of available labels.
- There is a preexisting machineconfiguration.openshift.io/rebuildImage rebuild label / annotation that can be used instead. However, to my understanding, there is a regression around this label that makes it not work as it should, so we could potentially re-use it for this scenario instead.
Implementation Details
- Appropriate unit tests / e2e tests are written for the chosen implementation.
- Detection of whether the build is retryable due to a failure is not in-scope for this issue.
- causes
-
OCPBUGS-19007 OCB builds fail when several MCPs are building at the same time
- Closed
- links to