Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
- mco_qe_required

Story Points:
8
Blocked:
False
Blocked Reason:
None
Ready:
False
Epic Link:
On Cluster Layering GA
Feature Link:
OCPSTRAT-1389 - On Cluster Layering: Phase 3 (GA)
Intelligence Requested:
Market:

Sprint:
MCO Sprint 259, MCO Sprint 260, MCO Sprint 261, MCO Sprint 262
Cost of Delay:
0
WSJF:
0.000

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Currently, when an on-cluster build fails, there is no easy way to clear the failed build status and objects so that another build can be performed. In this state, a cluster admin cannot perform any additional on-cluster builds for that MachineOSConfig until the build failure condition is cleared. Currently, the only way to do that is to delete the MachineOSConfig and recreate it, which is disruptive and undesirable. Instead, an alternative mechanism should be used.

Overall Flow

The cluster admin adds a label / annotation (e.g., machineconfiguration.openshift.io/force-rebuild) to the MachineOSConfig.
The BuildController will enter its sync loop and perform the following operations:
1. Delete all ephemeral build objects such as ConfigMaps and / or Secrets as well as the build pods themselves.
2. Delete the MachineOSBuild associated with the current build.
3. Restart the build process.
Once the build process has been restarted, BuildController will clear the rebuild label / annotation from the MachineOSConfig object.

Implementation Details

As of https://github.com/openshift/machine-config-operator/pull/4471, there are labels and annotations attached to all ephemeral build objects that identify what MachineOSConfig / MachineOSBuild / etc. they belong to as well as a machineconfiguration.openshift.io/ephemeral-build-object label that explicitly identifies an object as ephemeral. See: https://github.com/cheesesashimi/machine-config-operator/blob/9b501d90ea2cbd5bd2427bea0c7d2cc736796b1c/pkg/controller/build/constants.go for a more complete list of available labels.
There is a preexisting machineconfiguration.openshift.io/rebuildImage rebuild label / annotation that can be used instead. However, to my understanding, there is a regression around this label that makes it not work as it should, so we could potentially re-use it for this scenario instead.

Implementation Details

Appropriate unit tests / e2e tests are written for the chosen implementation.
Detection of whether the build is retryable due to a failure is not in-scope for this issue.

causes

OCPBUGS-19007 OCB builds fail when several MCPs are building at the same time

Closed

links to

openshift/machine-config-operator#4624: MCO-1327: MCO-756: MCO-1166: MCO-816: Make BuildController less monolithic

Assignee:: Zack Zlotnik

Reporter:: Zack Zlotnik

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/10/11 3:36 PM

Updated:: 2024/11/18 7:40 PM

Resolved:: 2024/11/18 7:40 PM

Details

Description

Overall Flow

Implementation Details

Implementation Details

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates