Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-35336

Node drain during upgrade failed when hosted clusters are present

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.16.0
    • None
    • Moderate
    • Yes
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Upgrade from OCP 4.16.0-ec.6 to 4.17.0-0.nightly-2024-06-07-045541 did not complete because node drain on a worker node failed due to the error given below.
      
      % oc logs -c machine-config-controller -n openshift-machine-config-operator machine-config-controller-55f757ff9f-vxptz --tail 50
      E0612 06:34:20.593438       1 drain_controller.go:152] error when evicting pods/"openshift-apiserver-75d6669f47-ws7rh" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:20.593514       1 drain_controller.go:152] error when evicting pods/"etcd-0" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:20.593529       1 drain_controller.go:152] error when evicting pods/"kube-apiserver-758f4c5fd9-gkvcq" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:20.593529       1 drain_controller.go:152] error when evicting pods/"openshift-oauth-apiserver-67bcbc4d6-sgqrk" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:20.593972       1 drain_controller.go:152] error when evicting pods/"kube-apiserver-7d6894d9bf-ct8pn" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:20.594014       1 drain_controller.go:152] error when evicting pods/"openshift-apiserver-7459fd76f5-ccwr4" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:20.594040       1 drain_controller.go:152] error when evicting pods/"oauth-openshift-6c87cd45-v6rz4" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:20.594067       1 drain_controller.go:152] error when evicting pods/"oauth-openshift-fbdc777b4-sndxs" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:20.596459       1 drain_controller.go:152] error when evicting pods/"openshift-oauth-apiserver-59d46c946c-pntsx" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:20.597398       1 drain_controller.go:152] error when evicting pods/"etcd-0" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      I0612 06:34:25.594426       1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/oauth-openshift-fbdc777b4-sndxs
      I0612 06:34:25.594457       1 drain_controller.go:152] evicting pod bm1-ag1-bm1-ag1/etcd-0
      I0612 06:34:25.594467       1 drain_controller.go:152] evicting pod bm1-ag1-bm1-ag1/kube-apiserver-7d6894d9bf-ct8pn
      I0612 06:34:25.594484       1 drain_controller.go:152] evicting pod bm1-ag1-bm1-ag1/oauth-openshift-6c87cd45-v6rz4
      I0612 06:34:25.594486       1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/kube-apiserver-758f4c5fd9-gkvcq
      I0612 06:34:25.594456       1 drain_controller.go:152] evicting pod bm1-ag1-bm1-ag1/openshift-oauth-apiserver-67bcbc4d6-sgqrk
      I0612 06:34:25.594472       1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/openshift-apiserver-7459fd76f5-ccwr4
      I0612 06:34:25.594513       1 drain_controller.go:152] evicting pod bm1-ag1-bm1-ag1/openshift-apiserver-75d6669f47-ws7rh
      I0612 06:34:25.596735       1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/openshift-oauth-apiserver-59d46c946c-pntsx
      I0612 06:34:25.597824       1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/etcd-0
      E0612 06:34:25.607568       1 drain_controller.go:152] error when evicting pods/"openshift-apiserver-7459fd76f5-ccwr4" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:25.607611       1 drain_controller.go:152] error when evicting pods/"kube-apiserver-758f4c5fd9-gkvcq" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:25.608161       1 drain_controller.go:152] error when evicting pods/"openshift-apiserver-75d6669f47-ws7rh" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:25.608177       1 drain_controller.go:152] error when evicting pods/"openshift-oauth-apiserver-59d46c946c-pntsx" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:25.608212       1 drain_controller.go:152] error when evicting pods/"kube-apiserver-7d6894d9bf-ct8pn" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:25.608217       1 drain_controller.go:152] error when evicting pods/"oauth-openshift-6c87cd45-v6rz4" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:25.608229       1 drain_controller.go:152] error when evicting pods/"openshift-oauth-apiserver-67bcbc4d6-sgqrk" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:25.608962       1 drain_controller.go:152] error when evicting pods/"oauth-openshift-fbdc777b4-sndxs" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:25.609089       1 drain_controller.go:152] error when evicting pods/"etcd-0" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      E0612 06:34:25.609349       1 drain_controller.go:152] error when evicting pods/"etcd-0" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      I0612 06:34:30.607984       1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/kube-apiserver-758f4c5fd9-gkvcq
      
      
      % oc get nodes
      NAME            STATUS                     ROLES                         AGE   VERSION
      baremetal1-01   Ready                      control-plane,master,worker   26d   v1.30.1+9798e19
      baremetal1-02   Ready                      control-plane,master,worker   26d   v1.30.1+9798e19
      baremetal1-03   Ready                      control-plane,master,worker   26d   v1.30.1+9798e19
      baremetal1-04   Ready                      worker                        26d   v1.29.4+d9d4530
      baremetal1-05   Ready                      worker                        26d   v1.29.4+d9d4530
      baremetal1-06   Ready,SchedulingDisabled   worker                        26d   v1.29.4+d9d4530
      
      
      These pods are associated with hosted clusters. Hosted cluster "hcp415-bm3-a" is using kubevirt and "bm1-ag1" is using agent.
      
      % oc get hostedcluster -A
      NAMESPACE   NAME           VERSION       KUBECONFIG                      PROGRESS    AVAILABLE   PROGRESSING   MESSAGE
      bm1-ag1     bm1-ag1        4.16.0-ec.6   bm1-ag1-admin-kubeconfig        Completed   True        False         The hosted control plane is available
      bm1-ag2     bm1-ag2        4.16.0-ec.6   bm1-ag2-admin-kubeconfig        Completed   True        False         The hosted control plane is available
      clusters    hcp415-bm1-b   4.15.13       hcp415-bm1-b-admin-kubeconfig   Completed   True        False         The hosted control plane is available
      clusters    hcp415-bm3-a   4.15.13       hcp415-bm3-a-admin-kubeconfig   Completed   True        False         The hosted control plane is available
      
      
      This is part of testing OpenShift Data Foundation in Provider Mode configuration.
      
      Upgrade command is
      % oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-07-045541 --allow-explicit-upgrade --force
      warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
      warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
      warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
      Requested update to release image registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-07-045541
      
      
      
      At the time of this error, master nodes were already drained and upgraded. Drain of the node "baremetal1-06" and a master node(which is also a worker) was started at the same time. But due to the issue, drain operation on the node "baremetal1-06" did not complete. Earlier one master not also faced problem evicting pods virt-launcher pods, eg:virt-launcher-hcp415-bm3-a due to lack of memory. This was resolved by decreasing the nodepool replicas on kubevirt based hosted clusters.
      
      
      Workaround tried:
      Manually delete the pods that did not evict. This completed the node drain. Same workaround applied during drain of the other two worker nodes. Though this resolved the problem, the impact of deleting the pods manually is not verified during the process. The hosted clusters stays connected after the upgrade. 
      
      
      Must-gather logs : http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/upgrade-drain-failure-must-gather/ (Red Hat network required). Adding the link because the option to browse files for attachment is not available for me.
      This link contains must-gather logs collected using Virtualization and ACM must-gather images in addition to origin must-gather.

       

       

      Version-Release number of selected component (if applicable):

      OCP 4.16.0-ec.6 (during upgrade to 4.17.0-0.nightly-2024-06-07-045541)
      Openshift Virtualization 4.15.1 (4.15 version is installed due to other issues in setting up the cluster using unreleased version 4.16)
      ACM 2.10.3

      How reproducible:

      Reporting the first occurance

      Steps to Reproduce:

      1. Prepare cluster with configuration: Managing cluster OCP 4.16.0-ec.6, CNV 4.15.1,  metallb-operator.v4.16.0-202405161711, MCE 2.5.3, advanced-cluster-management.v2.10.3, Hosted OCP cluster 4.15.13-x86_64 with master nodes schedule-able.
      2. Create hosted clusters using kubevirt and agent.
      3. Upgrade the cluster to 4.17.0 nightly builds.
      

      Actual results:

      Upgrade did not complete due to issues during node drain

      Expected results:

      Upgrade should complete successfully

      Additional info:

      Node drain bug on a similar cluster configuration (the pods causing the issue is different)- https://issues.redhat.com/browse/OCPBUGS-34543
      
      

       

              rhn-engineering-dvossel David Vossel
              jijoy@redhat.com Jilju Joy
              Liangquan Li Liangquan Li
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: