[OCPBUGS-21647] When OCB is enabled, a node is drained before completing the image build - Red Hat Issue Tracker

Type: Bug
Resolution: Won't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.15.0
Component/s: Machine Config Operator
Labels:
- mco-triaged
- qe-ocb-test

Severity:
Important
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

In a cluster with the "On Cluster Build" functionality enabled, when a MC is created and a build is triggered, a node is drained before the build is completed.

We create a MC, a build pod is created and whereupon we can see the Mahchine Config Controller executing a drain operation in a node.

If the build pod is executed in the same node that is being drained, the build fails because the pod is deleted.

Version-Release number of selected component (if applicable):

4.15

How reproducible:

Always

Steps to Reproduce:

1. Enable OCB functionality
cat << EOF | oc create -f -
apiVersion: v1
data:
  baseImagePullSecretName: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
  finalImagePushSecretName: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
  finalImagePullspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image"
  imageBuilderType: ""
kind: ConfigMap
metadata:
  name: on-cluster-build-config
  namespace: openshift-machine-config-operator
EOF

oc label mcp/worker machineconfiguration.openshift.io/layering-enabled=


2. Wait for the build of the current rendered MC to be completed.


3. Once the current rendered machine config build is completed, create any MC

Track the controller logs 

 oc logs -l k8s-app=machine-config-controller -f

Actual results:

After creating the MC, we can see that the controller log starts executing a drain operation, and that the drained node is reported as nonschedulable


I1016 11:05:16.522820       1 drain_controller.go:173] node ip-10-0-12-65.us-east-2.compute.internal: cordoning
I1016 11:05:16.522882       1 drain_controller.go:173] node ip-10-0-12-65.us-east-2.compute.internal: initiating cordon (currently schedulable: true)
I1016 11:05:16.542051       1 drain_controller.go:173] node ip-10-0-12-65.us-east-2.compute.internal: cordon succeeded (currently schedulable: false)
I1016 11:05:16.542071       1 drain_controller.go:173] node ip-10-0-12-65.us-east-2.compute.internal: initiating drain
I1016 11:05:16.588650       1 node_controller.go:493] Pool worker[zone=us-east-2a]: node ip-10-0-12-65.us-east-2.compute.internal: changed taints
E1016 11:05:17.184185       1 drain_controller.go:144] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-4z2jl, openshift-cluster-node-tuning-operator/tuned-wlcts, openshift-dns/dns-default-842r9, openshift-dns/node-resolver-6x9vz, openshift-image-registry/node-ca-5vj8x, openshift-ingress-canary/ingress-canary-rvv5h, openshift-machine-config-operator/machine-config-daemon-lkkbd, openshift-monitoring/node-exporter-k6z2h, openshift-multus/multus-additional-cni-plugins-7h6w9, openshift-multus/multus-mn8rg, openshift-multus/network-metrics-daemon-cs7ck, openshift-network-diagnostics/network-check-target-27zcn, openshift-sdn/sdn-95zb7
I1016 11:05:17.185759       1 drain_controller.go:144] evicting pod openshift-operator-lifecycle-manager/collect-profiles-28290900-8gglv
I1016 11:05:17.185772       1 drain_controller.go:144] evicting pod openshift-image-registry/image-registry-5b869dd868-scml4
I1016 11:05:17.185782       1 drain_controller.go:144] evicting pod openshift-machine-config-operator/machine-os-builder-557cd99c48-xpf6m
I1016 11:05:17.185818       1 drain_controller.go:144] evicting pod openshift-monitoring/openshift-state-metrics-676ff5979f-mzq8k

$ oc get nodes
NAME                                        STATUS                     ROLES                  AGE     VERSION
ip-10-0-12-65.us-east-2.compute.internal    Ready,SchedulingDisabled   worker                 48m     v1.28.2+481304a
ip-10-0-25-206.us-east-2.compute.internal   Ready                      control-plane,master   3h19m   v1.28.2+481304a
ip-10-0-40-118.us-east-2.compute.internal   Ready                      control-plane,master   3h19m   v1.28.2+481304a
ip-10-0-62-201.us-east-2.compute.internal   Ready                      worker                 3h12m   v1.28.2+481304a
ip-10-0-74-243.us-east-2.compute.internal   Ready                      control-plane,master   3h17m   v1.28.2+481304a

We can see these logs in the drained node:

I1016 10:52:19.511842    1452 daemon.go:670] Transitioned from state: Working -> Done
I1016 11:05:14.407513    1452 rpm-ostree.go:308] Running captured: rpm-ostree kargs
I1016 11:05:14.640528    1452 daemon.go:760] Preflight config drift check successful (took 423.738143ms)
I1016 11:05:14.646814    1452 config_drift_monitor.go:255] Config Drift Monitor has shut down
I1016 11:05:14.646831    1452 daemon.go:2157] Performing layered OS update
I1016 11:05:14.661712    1452 update.go:1977] Starting transition from "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image@sha256:6498f333a29f6d3943e852b039e77d894e47ee29b0212af38cf120ca7c4b85ce" to "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image@sha256:6498f333a29f6d3943e852b039e77d894e47ee29b0212af38cf120ca7c4b85ce"
I1016 11:05:14.665144    1452 update.go:1977] Update prepared; requesting cordon and drain via annotation to controller



A "layered OS update" operation has been executed in the node immediately after creating the build pod.


In this case we can see that the build pod was executed in the same node that was drained, so the build failed. But it is mostly random, the build can succeed if the build pod is executed in a different node.

$ oc get build
NAME                                                     TYPE     FROM         STATUS                    STARTED         DURATION
build-rendered-worker-5620288caa936354c33583d4d2a896dc   Docker   Dockerfile   Error (BuildPodDeleted)   3 minutes ago   10s

Expected results:

No drain operation should be triggered in any node when we create the build pod.

Additional info:

is incorporated by

MCO-1231 Use Kubernetes Job objects for image builds

Closed

is related to

OCPBUGS-19007 OCB builds fail when several MCPs are building at the same time

Closed

relates to

MCO-665 On-Cluster Layering Tech Preview

Closed

Assignee:: Team MCO

Reporter:: Sergio Regidor de la Rosa

QA Contact:: Sergio Regidor de la Rosa

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/10/16 11:14 AM

Updated:: 2024/11/19 11:48 PM

Resolved:: 2024/11/19 11:48 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide