-
Bug
-
Resolution: Won't Do
-
Normal
-
None
-
4.15.0
-
Important
-
No
-
Rejected
-
False
-
Description of problem:
In a cluster with the "On Cluster Build" functionality enabled, when a MC is created and a build is triggered, a node is drained before the build is completed. We create a MC, a build pod is created and whereupon we can see the Mahchine Config Controller executing a drain operation in a node. If the build pod is executed in the same node that is being drained, the build fails because the pod is deleted.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Enable OCB functionality cat << EOF | oc create -f - apiVersion: v1 data: baseImagePullSecretName: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy") finalImagePushSecretName: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}') finalImagePullspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image" imageBuilderType: "" kind: ConfigMap metadata: name: on-cluster-build-config namespace: openshift-machine-config-operator EOF oc label mcp/worker machineconfiguration.openshift.io/layering-enabled= 2. Wait for the build of the current rendered MC to be completed. 3. Once the current rendered machine config build is completed, create any MC Track the controller logs oc logs -l k8s-app=machine-config-controller -f
Actual results:
After creating the MC, we can see that the controller log starts executing a drain operation, and that the drained node is reported as nonschedulable I1016 11:05:16.522820 1 drain_controller.go:173] node ip-10-0-12-65.us-east-2.compute.internal: cordoning I1016 11:05:16.522882 1 drain_controller.go:173] node ip-10-0-12-65.us-east-2.compute.internal: initiating cordon (currently schedulable: true) I1016 11:05:16.542051 1 drain_controller.go:173] node ip-10-0-12-65.us-east-2.compute.internal: cordon succeeded (currently schedulable: false) I1016 11:05:16.542071 1 drain_controller.go:173] node ip-10-0-12-65.us-east-2.compute.internal: initiating drain I1016 11:05:16.588650 1 node_controller.go:493] Pool worker[zone=us-east-2a]: node ip-10-0-12-65.us-east-2.compute.internal: changed taints E1016 11:05:17.184185 1 drain_controller.go:144] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-4z2jl, openshift-cluster-node-tuning-operator/tuned-wlcts, openshift-dns/dns-default-842r9, openshift-dns/node-resolver-6x9vz, openshift-image-registry/node-ca-5vj8x, openshift-ingress-canary/ingress-canary-rvv5h, openshift-machine-config-operator/machine-config-daemon-lkkbd, openshift-monitoring/node-exporter-k6z2h, openshift-multus/multus-additional-cni-plugins-7h6w9, openshift-multus/multus-mn8rg, openshift-multus/network-metrics-daemon-cs7ck, openshift-network-diagnostics/network-check-target-27zcn, openshift-sdn/sdn-95zb7 I1016 11:05:17.185759 1 drain_controller.go:144] evicting pod openshift-operator-lifecycle-manager/collect-profiles-28290900-8gglv I1016 11:05:17.185772 1 drain_controller.go:144] evicting pod openshift-image-registry/image-registry-5b869dd868-scml4 I1016 11:05:17.185782 1 drain_controller.go:144] evicting pod openshift-machine-config-operator/machine-os-builder-557cd99c48-xpf6m I1016 11:05:17.185818 1 drain_controller.go:144] evicting pod openshift-monitoring/openshift-state-metrics-676ff5979f-mzq8k $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-12-65.us-east-2.compute.internal Ready,SchedulingDisabled worker 48m v1.28.2+481304a ip-10-0-25-206.us-east-2.compute.internal Ready control-plane,master 3h19m v1.28.2+481304a ip-10-0-40-118.us-east-2.compute.internal Ready control-plane,master 3h19m v1.28.2+481304a ip-10-0-62-201.us-east-2.compute.internal Ready worker 3h12m v1.28.2+481304a ip-10-0-74-243.us-east-2.compute.internal Ready control-plane,master 3h17m v1.28.2+481304a We can see these logs in the drained node: I1016 10:52:19.511842 1452 daemon.go:670] Transitioned from state: Working -> Done I1016 11:05:14.407513 1452 rpm-ostree.go:308] Running captured: rpm-ostree kargs I1016 11:05:14.640528 1452 daemon.go:760] Preflight config drift check successful (took 423.738143ms) I1016 11:05:14.646814 1452 config_drift_monitor.go:255] Config Drift Monitor has shut down I1016 11:05:14.646831 1452 daemon.go:2157] Performing layered OS update I1016 11:05:14.661712 1452 update.go:1977] Starting transition from "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image@sha256:6498f333a29f6d3943e852b039e77d894e47ee29b0212af38cf120ca7c4b85ce" to "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image@sha256:6498f333a29f6d3943e852b039e77d894e47ee29b0212af38cf120ca7c4b85ce" I1016 11:05:14.665144 1452 update.go:1977] Update prepared; requesting cordon and drain via annotation to controller A "layered OS update" operation has been executed in the node immediately after creating the build pod. In this case we can see that the build pod was executed in the same node that was drained, so the build failed. But it is mostly random, the build can succeed if the build pod is executed in a different node. $ oc get build NAME TYPE FROM STATUS STARTED DURATION build-rendered-worker-5620288caa936354c33583d4d2a896dc Docker Dockerfile Error (BuildPodDeleted) 3 minutes ago 10s
Expected results:
No drain operation should be triggered in any node when we create the build pod.
Additional info:
- is incorporated by
-
MCO-1231 Use Kubernetes Job objects for image builds
- Closed
- is related to
-
OCPBUGS-19007 OCB builds fail when several MCPs are building at the same time
- Closed
- relates to
-
MCO-665 On-Cluster Layering Tech Preview
- Closed