-
Bug
-
Resolution: Done-Errata
-
Normal
-
None
-
4.14.0
-
+
-
Moderate
-
No
-
MCO Sprint 249, MCO Sprint 250, MCO Sprint 251, MCO Sprint 252
-
4
-
False
-
Description of problem:
When in a cluster several MachineConfigPools with the on-cluster-build functionality enabled are building images at the same time, some of those builds fail with status "Error (BuildPodDeleted)".
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-09-12-195514 True False 6h21m Cluster version is 4.14.0-0.nightly-2023-09-12-195514
How reproducible:
Always
Steps to Reproduce:
1. Create the configuration resources needed by the OCB functionality. To reproduce this issue we use an on-cluster-build-config configmap with an empty imageBuilderType oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": ""}}' 2. Create 5 custom pools for n in {1..5} do echo $n cat << EOF | oc create -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: infra$n spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra$n]} nodeSelector: matchLabels: node-role.kubernetes.io/infra$n: "" EOF done 3. Label the pools to enable the OCB functionality for n in {1..5} do echo $n oc label mcp/infra$n machineconfiguration.openshift.io/layering-enabled= done 4. Wait for the builds to finish. The builds should finish OK. 5. Create a MC to trigger another build. This one, for example: cat << EOF | oc create -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: test-machine-config spec: config: ignition: version: 3.1.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,dGVzdA== filesystem: root mode: 420 path: /etc/my-test-file.test EOF
Actual results:
The new builds are triggered, but some of the pods are Terminated before they can finish. Builds are failed with "Error (BuildPodDeleted)" NAME READY STATUS RESTARTS AGE pod/build-rendered-infra1-fc68772b20de56ea566bb8f81a53e3d1-build 1/1 Running 0 25s pod/build-rendered-infra4-fc68772b20de56ea566bb8f81a53e3d1-build 1/1 Running 0 22s pod/build-rendered-infra5-fc68772b20de56ea566bb8f81a53e3d1-build 1/1 Running 0 20s pod/machine-config-controller-5bdd7b66c5-dl4hh 2/2 Running 0 6h48m pod/machine-config-daemon-5wbw4 2/2 Running 0 6h48m pod/machine-config-daemon-fqr8x 2/2 Running 0 6h48m pod/machine-config-daemon-g77zd 2/2 Running 12 6h41m pod/machine-config-daemon-qzmvv 2/2 Running 20 6h41m pod/machine-config-daemon-w8mnz 2/2 Running 0 6h48m pod/machine-config-operator-7dd564556d-mqc5w 2/2 Running 0 6h50m pod/machine-config-server-28lnp 1/1 Running 0 6h47m pod/machine-config-server-5csjz 1/1 Running 0 6h47m pod/machine-config-server-fv4vk 1/1 Running 0 6h47m pod/machine-os-builder-6cfbd8d5d-pbdz5 1/1 Running 0 4m19s NAME TYPE FROM STATUS STARTED DURATION build.build.openshift.io/build-rendered-infra1-fc68772b20de56ea566bb8f81a53e3d1 Docker Dockerfile Running 25 seconds ago build.build.openshift.io/build-rendered-infra2-fc68772b20de56ea566bb8f81a53e3d1 Docker Dockerfile Error (BuildPodDeleted) 25 seconds ago 12s build.build.openshift.io/build-rendered-infra3-fc68772b20de56ea566bb8f81a53e3d1 Docker Dockerfile Error (BuildPodDeleted) 23 seconds ago 13s build.build.openshift.io/build-rendered-infra4-fc68772b20de56ea566bb8f81a53e3d1 Docker Dockerfile Running 22 seconds ago build.build.openshift.io/build-rendered-infra5-fc68772b20de56ea566bb8f81a53e3d1 Docker Dockerfile Running 20 seconds ago
Expected results:
The builds should not fail.
Additional info:
There is a link to the must-gather file in the first comment in this jira ticket.
- is caused by
-
MCO-816 Graceful build failure recovery (un-wedge buildController)
- In Progress
- relates to
-
OCPBUGS-21647 When OCB is enabled, a node is drained before completing the image build
- New
- links to
-
RHBA-2024:4156 OpenShift Container Platform 4.16.z bug fix update