-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.16.0
-
None
-
Moderate
-
Yes
-
False
-
Description of problem:
Upgrade from OCP 4.16.0-ec.6 to 4.17.0-0.nightly-2024-06-07-045541 did not complete because node drain on a worker node failed due to the error given below. % oc logs -c machine-config-controller -n openshift-machine-config-operator machine-config-controller-55f757ff9f-vxptz --tail 50 E0612 06:34:20.593438 1 drain_controller.go:152] error when evicting pods/"openshift-apiserver-75d6669f47-ws7rh" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:20.593514 1 drain_controller.go:152] error when evicting pods/"etcd-0" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:20.593529 1 drain_controller.go:152] error when evicting pods/"kube-apiserver-758f4c5fd9-gkvcq" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:20.593529 1 drain_controller.go:152] error when evicting pods/"openshift-oauth-apiserver-67bcbc4d6-sgqrk" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:20.593972 1 drain_controller.go:152] error when evicting pods/"kube-apiserver-7d6894d9bf-ct8pn" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:20.594014 1 drain_controller.go:152] error when evicting pods/"openshift-apiserver-7459fd76f5-ccwr4" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:20.594040 1 drain_controller.go:152] error when evicting pods/"oauth-openshift-6c87cd45-v6rz4" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:20.594067 1 drain_controller.go:152] error when evicting pods/"oauth-openshift-fbdc777b4-sndxs" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:20.596459 1 drain_controller.go:152] error when evicting pods/"openshift-oauth-apiserver-59d46c946c-pntsx" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:20.597398 1 drain_controller.go:152] error when evicting pods/"etcd-0" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0612 06:34:25.594426 1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/oauth-openshift-fbdc777b4-sndxs I0612 06:34:25.594457 1 drain_controller.go:152] evicting pod bm1-ag1-bm1-ag1/etcd-0 I0612 06:34:25.594467 1 drain_controller.go:152] evicting pod bm1-ag1-bm1-ag1/kube-apiserver-7d6894d9bf-ct8pn I0612 06:34:25.594484 1 drain_controller.go:152] evicting pod bm1-ag1-bm1-ag1/oauth-openshift-6c87cd45-v6rz4 I0612 06:34:25.594486 1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/kube-apiserver-758f4c5fd9-gkvcq I0612 06:34:25.594456 1 drain_controller.go:152] evicting pod bm1-ag1-bm1-ag1/openshift-oauth-apiserver-67bcbc4d6-sgqrk I0612 06:34:25.594472 1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/openshift-apiserver-7459fd76f5-ccwr4 I0612 06:34:25.594513 1 drain_controller.go:152] evicting pod bm1-ag1-bm1-ag1/openshift-apiserver-75d6669f47-ws7rh I0612 06:34:25.596735 1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/openshift-oauth-apiserver-59d46c946c-pntsx I0612 06:34:25.597824 1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/etcd-0 E0612 06:34:25.607568 1 drain_controller.go:152] error when evicting pods/"openshift-apiserver-7459fd76f5-ccwr4" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:25.607611 1 drain_controller.go:152] error when evicting pods/"kube-apiserver-758f4c5fd9-gkvcq" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:25.608161 1 drain_controller.go:152] error when evicting pods/"openshift-apiserver-75d6669f47-ws7rh" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:25.608177 1 drain_controller.go:152] error when evicting pods/"openshift-oauth-apiserver-59d46c946c-pntsx" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:25.608212 1 drain_controller.go:152] error when evicting pods/"kube-apiserver-7d6894d9bf-ct8pn" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:25.608217 1 drain_controller.go:152] error when evicting pods/"oauth-openshift-6c87cd45-v6rz4" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:25.608229 1 drain_controller.go:152] error when evicting pods/"openshift-oauth-apiserver-67bcbc4d6-sgqrk" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:25.608962 1 drain_controller.go:152] error when evicting pods/"oauth-openshift-fbdc777b4-sndxs" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:25.609089 1 drain_controller.go:152] error when evicting pods/"etcd-0" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E0612 06:34:25.609349 1 drain_controller.go:152] error when evicting pods/"etcd-0" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I0612 06:34:30.607984 1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/kube-apiserver-758f4c5fd9-gkvcq % oc get nodes NAME STATUS ROLES AGE VERSION baremetal1-01 Ready control-plane,master,worker 26d v1.30.1+9798e19 baremetal1-02 Ready control-plane,master,worker 26d v1.30.1+9798e19 baremetal1-03 Ready control-plane,master,worker 26d v1.30.1+9798e19 baremetal1-04 Ready worker 26d v1.29.4+d9d4530 baremetal1-05 Ready worker 26d v1.29.4+d9d4530 baremetal1-06 Ready,SchedulingDisabled worker 26d v1.29.4+d9d4530 These pods are associated with hosted clusters. Hosted cluster "hcp415-bm3-a" is using kubevirt and "bm1-ag1" is using agent. % oc get hostedcluster -A NAMESPACE NAME VERSION KUBECONFIG PROGRESS AVAILABLE PROGRESSING MESSAGE bm1-ag1 bm1-ag1 4.16.0-ec.6 bm1-ag1-admin-kubeconfig Completed True False The hosted control plane is available bm1-ag2 bm1-ag2 4.16.0-ec.6 bm1-ag2-admin-kubeconfig Completed True False The hosted control plane is available clusters hcp415-bm1-b 4.15.13 hcp415-bm1-b-admin-kubeconfig Completed True False The hosted control plane is available clusters hcp415-bm3-a 4.15.13 hcp415-bm3-a-admin-kubeconfig Completed True False The hosted control plane is available This is part of testing OpenShift Data Foundation in Provider Mode configuration. Upgrade command is % oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-07-045541 --allow-explicit-upgrade --force warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Requested update to release image registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-07-045541 At the time of this error, master nodes were already drained and upgraded. Drain of the node "baremetal1-06" and a master node(which is also a worker) was started at the same time. But due to the issue, drain operation on the node "baremetal1-06" did not complete. Earlier one master not also faced problem evicting pods virt-launcher pods, eg:virt-launcher-hcp415-bm3-a due to lack of memory. This was resolved by decreasing the nodepool replicas on kubevirt based hosted clusters. Workaround tried: Manually delete the pods that did not evict. This completed the node drain. Same workaround applied during drain of the other two worker nodes. Though this resolved the problem, the impact of deleting the pods manually is not verified during the process. The hosted clusters stays connected after the upgrade. Must-gather logs : http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/upgrade-drain-failure-must-gather/ (Red Hat network required). Adding the link because the option to browse files for attachment is not available for me. This link contains must-gather logs collected using Virtualization and ACM must-gather images in addition to origin must-gather.
Version-Release number of selected component (if applicable):
OCP 4.16.0-ec.6 (during upgrade to 4.17.0-0.nightly-2024-06-07-045541) Openshift Virtualization 4.15.1 (4.15 version is installed due to other issues in setting up the cluster using unreleased version 4.16) ACM 2.10.3
How reproducible:
Reporting the first occurance
Steps to Reproduce:
1. Prepare cluster with configuration: Managing cluster OCP 4.16.0-ec.6, CNV 4.15.1, metallb-operator.v4.16.0-202405161711, MCE 2.5.3, advanced-cluster-management.v2.10.3, Hosted OCP cluster 4.15.13-x86_64 with master nodes schedule-able. 2. Create hosted clusters using kubevirt and agent. 3. Upgrade the cluster to 4.17.0 nightly builds.
Actual results:
Upgrade did not complete due to issues during node drain
Expected results:
Upgrade should complete successfully
Additional info:
Node drain bug on a similar cluster configuration (the pods causing the issue is different)- https://issues.redhat.com/browse/OCPBUGS-34543