Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-21647

When OCB is enabled, a node is drained before completing the image build

XMLWordPrintable

    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      In a cluster with the "On Cluster Build" functionality enabled, when a MC is created and a build is triggered, a node is drained before the build is completed.
      
      We create a MC, a build pod is created and whereupon we can see the Mahchine Config Controller executing a drain operation in a node.
      
      If the build pod is executed in the same node that is being drained, the build fails because the pod is deleted.
      
       

      Version-Release number of selected component (if applicable):

      4.15
       

      How reproducible:

      Always
       

      Steps to Reproduce:

      1. Enable OCB functionality
      cat << EOF | oc create -f -
      apiVersion: v1
      data:
        baseImagePullSecretName: $(oc get secret -n openshift-config pull-secret -o json | jq "del(.metadata.namespace, .metadata.creationTimestamp, .metadata.resourceVersion, .metadata.uid, .metadata.name)" | jq '.metadata.name="pull-copy"' | oc -n openshift-machine-config-operator create -f - &> /dev/null; echo -n "pull-copy")
        finalImagePushSecretName: $(oc get -n openshift-machine-config-operator sa builder -ojsonpath='{.secrets[0].name}')
        finalImagePullspec: "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image"
        imageBuilderType: ""
      kind: ConfigMap
      metadata:
        name: on-cluster-build-config
        namespace: openshift-machine-config-operator
      EOF
      
      oc label mcp/worker machineconfiguration.openshift.io/layering-enabled=
      
      
      2. Wait for the build of the current rendered MC to be completed.
      
      
      3. Once the current rendered machine config build is completed, create any MC
      
      Track the controller logs 
      
       oc logs -l k8s-app=machine-config-controller -f
      
      
      
      

      Actual results:

      After creating the MC, we can see that the controller log starts executing a drain operation, and that the drained node is reported as nonschedulable
      
      
      I1016 11:05:16.522820       1 drain_controller.go:173] node ip-10-0-12-65.us-east-2.compute.internal: cordoning
      I1016 11:05:16.522882       1 drain_controller.go:173] node ip-10-0-12-65.us-east-2.compute.internal: initiating cordon (currently schedulable: true)
      I1016 11:05:16.542051       1 drain_controller.go:173] node ip-10-0-12-65.us-east-2.compute.internal: cordon succeeded (currently schedulable: false)
      I1016 11:05:16.542071       1 drain_controller.go:173] node ip-10-0-12-65.us-east-2.compute.internal: initiating drain
      I1016 11:05:16.588650       1 node_controller.go:493] Pool worker[zone=us-east-2a]: node ip-10-0-12-65.us-east-2.compute.internal: changed taints
      E1016 11:05:17.184185       1 drain_controller.go:144] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-4z2jl, openshift-cluster-node-tuning-operator/tuned-wlcts, openshift-dns/dns-default-842r9, openshift-dns/node-resolver-6x9vz, openshift-image-registry/node-ca-5vj8x, openshift-ingress-canary/ingress-canary-rvv5h, openshift-machine-config-operator/machine-config-daemon-lkkbd, openshift-monitoring/node-exporter-k6z2h, openshift-multus/multus-additional-cni-plugins-7h6w9, openshift-multus/multus-mn8rg, openshift-multus/network-metrics-daemon-cs7ck, openshift-network-diagnostics/network-check-target-27zcn, openshift-sdn/sdn-95zb7
      I1016 11:05:17.185759       1 drain_controller.go:144] evicting pod openshift-operator-lifecycle-manager/collect-profiles-28290900-8gglv
      I1016 11:05:17.185772       1 drain_controller.go:144] evicting pod openshift-image-registry/image-registry-5b869dd868-scml4
      I1016 11:05:17.185782       1 drain_controller.go:144] evicting pod openshift-machine-config-operator/machine-os-builder-557cd99c48-xpf6m
      I1016 11:05:17.185818       1 drain_controller.go:144] evicting pod openshift-monitoring/openshift-state-metrics-676ff5979f-mzq8k
      
      $ oc get nodes
      NAME                                        STATUS                     ROLES                  AGE     VERSION
      ip-10-0-12-65.us-east-2.compute.internal    Ready,SchedulingDisabled   worker                 48m     v1.28.2+481304a
      ip-10-0-25-206.us-east-2.compute.internal   Ready                      control-plane,master   3h19m   v1.28.2+481304a
      ip-10-0-40-118.us-east-2.compute.internal   Ready                      control-plane,master   3h19m   v1.28.2+481304a
      ip-10-0-62-201.us-east-2.compute.internal   Ready                      worker                 3h12m   v1.28.2+481304a
      ip-10-0-74-243.us-east-2.compute.internal   Ready                      control-plane,master   3h17m   v1.28.2+481304a
      
      We can see these logs in the drained node:
      
      I1016 10:52:19.511842    1452 daemon.go:670] Transitioned from state: Working -> Done
      I1016 11:05:14.407513    1452 rpm-ostree.go:308] Running captured: rpm-ostree kargs
      I1016 11:05:14.640528    1452 daemon.go:760] Preflight config drift check successful (took 423.738143ms)
      I1016 11:05:14.646814    1452 config_drift_monitor.go:255] Config Drift Monitor has shut down
      I1016 11:05:14.646831    1452 daemon.go:2157] Performing layered OS update
      I1016 11:05:14.661712    1452 update.go:1977] Starting transition from "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image@sha256:6498f333a29f6d3943e852b039e77d894e47ee29b0212af38cf120ca7c4b85ce" to "image-registry.openshift-image-registry.svc:5000/openshift-machine-config-operator/ocb-image@sha256:6498f333a29f6d3943e852b039e77d894e47ee29b0212af38cf120ca7c4b85ce"
      I1016 11:05:14.665144    1452 update.go:1977] Update prepared; requesting cordon and drain via annotation to controller
      
      
      
      A "layered OS update" operation has been executed in the node immediately after creating the build pod.
      
      
      In this case we can see that the build pod was executed in the same node that was drained, so the build failed. But it is mostly random, the build can succeed if the build pod is executed in a different node.
      
      $ oc get build
      NAME                                                     TYPE     FROM         STATUS                    STARTED         DURATION
      build-rendered-worker-5620288caa936354c33583d4d2a896dc   Docker   Dockerfile   Error (BuildPodDeleted)   3 minutes ago   10s
      
      
       

      Expected results:

      No drain operation should be triggered in any node when we create the build pod.
      
       

      Additional info:

       

            team-mco Team MCO
            sregidor@redhat.com Sergio Regidor de la Rosa
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: