Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.16.0
Component/s: HyperShift / OCP Virtualization
Labels:
None

Severity:
Moderate
Regression:
Yes
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Upgrade from OCP 4.16.0-ec.6 to 4.17.0-0.nightly-2024-06-07-045541 did not complete because node drain on a worker node failed due to the error given below.

% oc logs -c machine-config-controller -n openshift-machine-config-operator machine-config-controller-55f757ff9f-vxptz --tail 50
E0612 06:34:20.593438       1 drain_controller.go:152] error when evicting pods/"openshift-apiserver-75d6669f47-ws7rh" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:20.593514       1 drain_controller.go:152] error when evicting pods/"etcd-0" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:20.593529       1 drain_controller.go:152] error when evicting pods/"kube-apiserver-758f4c5fd9-gkvcq" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:20.593529       1 drain_controller.go:152] error when evicting pods/"openshift-oauth-apiserver-67bcbc4d6-sgqrk" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:20.593972       1 drain_controller.go:152] error when evicting pods/"kube-apiserver-7d6894d9bf-ct8pn" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:20.594014       1 drain_controller.go:152] error when evicting pods/"openshift-apiserver-7459fd76f5-ccwr4" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:20.594040       1 drain_controller.go:152] error when evicting pods/"oauth-openshift-6c87cd45-v6rz4" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:20.594067       1 drain_controller.go:152] error when evicting pods/"oauth-openshift-fbdc777b4-sndxs" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:20.596459       1 drain_controller.go:152] error when evicting pods/"openshift-oauth-apiserver-59d46c946c-pntsx" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:20.597398       1 drain_controller.go:152] error when evicting pods/"etcd-0" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0612 06:34:25.594426       1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/oauth-openshift-fbdc777b4-sndxs
I0612 06:34:25.594457       1 drain_controller.go:152] evicting pod bm1-ag1-bm1-ag1/etcd-0
I0612 06:34:25.594467       1 drain_controller.go:152] evicting pod bm1-ag1-bm1-ag1/kube-apiserver-7d6894d9bf-ct8pn
I0612 06:34:25.594484       1 drain_controller.go:152] evicting pod bm1-ag1-bm1-ag1/oauth-openshift-6c87cd45-v6rz4
I0612 06:34:25.594486       1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/kube-apiserver-758f4c5fd9-gkvcq
I0612 06:34:25.594456       1 drain_controller.go:152] evicting pod bm1-ag1-bm1-ag1/openshift-oauth-apiserver-67bcbc4d6-sgqrk
I0612 06:34:25.594472       1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/openshift-apiserver-7459fd76f5-ccwr4
I0612 06:34:25.594513       1 drain_controller.go:152] evicting pod bm1-ag1-bm1-ag1/openshift-apiserver-75d6669f47-ws7rh
I0612 06:34:25.596735       1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/openshift-oauth-apiserver-59d46c946c-pntsx
I0612 06:34:25.597824       1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/etcd-0
E0612 06:34:25.607568       1 drain_controller.go:152] error when evicting pods/"openshift-apiserver-7459fd76f5-ccwr4" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:25.607611       1 drain_controller.go:152] error when evicting pods/"kube-apiserver-758f4c5fd9-gkvcq" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:25.608161       1 drain_controller.go:152] error when evicting pods/"openshift-apiserver-75d6669f47-ws7rh" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:25.608177       1 drain_controller.go:152] error when evicting pods/"openshift-oauth-apiserver-59d46c946c-pntsx" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:25.608212       1 drain_controller.go:152] error when evicting pods/"kube-apiserver-7d6894d9bf-ct8pn" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:25.608217       1 drain_controller.go:152] error when evicting pods/"oauth-openshift-6c87cd45-v6rz4" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:25.608229       1 drain_controller.go:152] error when evicting pods/"openshift-oauth-apiserver-67bcbc4d6-sgqrk" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:25.608962       1 drain_controller.go:152] error when evicting pods/"oauth-openshift-fbdc777b4-sndxs" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:25.609089       1 drain_controller.go:152] error when evicting pods/"etcd-0" -n "bm1-ag1-bm1-ag1" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E0612 06:34:25.609349       1 drain_controller.go:152] error when evicting pods/"etcd-0" -n "clusters-hcp415-bm3-a" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0612 06:34:30.607984       1 drain_controller.go:152] evicting pod clusters-hcp415-bm3-a/kube-apiserver-758f4c5fd9-gkvcq


% oc get nodes
NAME            STATUS                     ROLES                         AGE   VERSION
baremetal1-01   Ready                      control-plane,master,worker   26d   v1.30.1+9798e19
baremetal1-02   Ready                      control-plane,master,worker   26d   v1.30.1+9798e19
baremetal1-03   Ready                      control-plane,master,worker   26d   v1.30.1+9798e19
baremetal1-04   Ready                      worker                        26d   v1.29.4+d9d4530
baremetal1-05   Ready                      worker                        26d   v1.29.4+d9d4530
baremetal1-06   Ready,SchedulingDisabled   worker                        26d   v1.29.4+d9d4530


These pods are associated with hosted clusters. Hosted cluster "hcp415-bm3-a" is using kubevirt and "bm1-ag1" is using agent.

% oc get hostedcluster -A
NAMESPACE   NAME           VERSION       KUBECONFIG                      PROGRESS    AVAILABLE   PROGRESSING   MESSAGE
bm1-ag1     bm1-ag1        4.16.0-ec.6   bm1-ag1-admin-kubeconfig        Completed   True        False         The hosted control plane is available
bm1-ag2     bm1-ag2        4.16.0-ec.6   bm1-ag2-admin-kubeconfig        Completed   True        False         The hosted control plane is available
clusters    hcp415-bm1-b   4.15.13       hcp415-bm1-b-admin-kubeconfig   Completed   True        False         The hosted control plane is available
clusters    hcp415-bm3-a   4.15.13       hcp415-bm3-a-admin-kubeconfig   Completed   True        False         The hosted control plane is available


This is part of testing OpenShift Data Foundation in Provider Mode configuration.

Upgrade command is
% oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-07-045541 --allow-explicit-upgrade --force
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.ci.openshift.org/ocp/release:4.17.0-0.nightly-2024-06-07-045541



At the time of this error, master nodes were already drained and upgraded. Drain of the node "baremetal1-06" and a master node(which is also a worker) was started at the same time. But due to the issue, drain operation on the node "baremetal1-06" did not complete. Earlier one master not also faced problem evicting pods virt-launcher pods, eg:virt-launcher-hcp415-bm3-a due to lack of memory. This was resolved by decreasing the nodepool replicas on kubevirt based hosted clusters.


Workaround tried:
Manually delete the pods that did not evict. This completed the node drain. Same workaround applied during drain of the other two worker nodes. Though this resolved the problem, the impact of deleting the pods manually is not verified during the process. The hosted clusters stays connected after the upgrade. 


Must-gather logs : http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/upgrade-drain-failure-must-gather/ (Red Hat network required). Adding the link because the option to browse files for attachment is not available for me.
This link contains must-gather logs collected using Virtualization and ACM must-gather images in addition to origin must-gather.

Version-Release number of selected component (if applicable):

OCP 4.16.0-ec.6 (during upgrade to 4.17.0-0.nightly-2024-06-07-045541)
Openshift Virtualization 4.15.1 (4.15 version is installed due to other issues in setting up the cluster using unreleased version 4.16)
ACM 2.10.3

How reproducible:

Reporting the first occurance

Steps to Reproduce:

1. Prepare cluster with configuration: Managing cluster OCP 4.16.0-ec.6, CNV 4.15.1,  metallb-operator.v4.16.0-202405161711, MCE 2.5.3, advanced-cluster-management.v2.10.3, Hosted OCP cluster 4.15.13-x86_64 with master nodes schedule-able.
2. Create hosted clusters using kubevirt and agent.
3. Upgrade the cluster to 4.17.0 nightly builds.

Actual results:

Upgrade did not complete due to issues during node drain

Expected results:

Upgrade should complete successfully

Additional info:

Node drain bug on a similar cluster configuration (the pods causing the issue is different)- https://issues.redhat.com/browse/OCPBUGS-34543

Assignee:: David Vossel

Reporter:: Jilju Joy

QA Contact:: Liangquan Li

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/06/12 10:29 AM

Updated:: 2024/07/24 10:05 AM

Details

Description

Attachments

Activity

People

Dates