-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
None
-
-
None
Description of problem:
During 100 node / 10K VM high scale testing, after multiple node drain + descheduler testing rounds, one node is stuck draining on a single VM for many hours: oc logs machine-config-controller-5b5dcb74d4-tldmz -n openshift-machine-config-operator | grep d22-h02-000-r650 I0825 13:38:38.638362 1 drain_controller.go:193] node d22-h02-000-r650: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when evicting pods/"virt-launcher-rhel9-4050-rc7x5" -n "vm-ns-41": global timeout reached: 1m30s the blocking VM was migrated to that node "Successfully" but the src pod didn't go to Complete it is now showing NotReady, not sure if that is the reason a new VMIM cannot be processed.... virt-launcher-rhel9-4050-7f4p8 1/2 NotReady 0 3d18h 10.129.36.78 d41-h17-000-r660 <none> 1/1 virt-launcher-rhel9-4050-rc7x5 2/2 Running 0 3d6h 10.130.18.145 d22-h02-000-r650 <none> 1/1 That virt-launcher-rhel9-4050-7f4p8 pod log: {"component":"virt-launcher","kind":"","level":"warning","msg":"failed to get domain job info, will retry","name":"rhel9-4050","namespace":"vm-ns-41","pos":"live-migration-source.go:661","reason":"virError(Code=42, Domain=10, Message='Domain not found: no domain with matching uuid '72e68b80-01fa-5153-bb1d-894198f47279' (vm-ns-41_rhel9-4050)')","timestamp":"2025-08-22T07:43:02.732666Z","uid":"a5211a6a-f724-4692-83e5-2a38c78cec07"} {"component":"virt-launcher","level":"info","msg":"Process vm-ns-41_rhel9-4050 and pid 65 is gone!","pos":"monitor.go:179","timestamp":"2025-08-22T07:43:02.732775Z"} {"component":"virt-launcher","level":"info","msg":"Waiting on final notifications to be sent to virt-handler.","pos":"virt-launcher.go:282","timestamp":"2025-08-22T07:43:02.732812Z"} {"component":"virt-launcher","level":"info","msg":"Final Delete notification sent","pos":"virt-launcher.go:297","timestamp":"2025-08-22T07:43:02.732829Z"} {"component":"virt-launcher","kind":"","level":"warning","msg":"failed to get domain job info, will retry","name":"rhel9-4050","namespace":"vm-ns-41","pos":"live-migration-source.go:661","reason":"virError(Code=42, Domain=10, Message='Domain not found: no domain with matching uuid '72e68b80-01fa-5153-bb1d-894198f47279' (vm-ns-41_rhel9-4050)')","timestamp":"2025-08-22T07:43:02.732791Z","uid":"a5211a6a-f724-4692-83e5-2a38c78cec07"} {"component":"virt-launcher","level":"info","msg":"stopping cmd server","pos":"server.go:625","timestamp":"2025-08-22T07:43:02.732882Z"} {"component":"virt-launcher","level":"info","msg":"cmd server stopped","pos":"server.go:634","timestamp":"2025-08-22T07:43:02.733086Z"} {"component":"virt-launcher","level":"info","msg":"Exiting...","pos":"virt-launcher.go:513","timestamp":"2025-08-22T07:43:02.733123Z"} {"component":"virt-launcher-monitor","level":"info","msg":"Reaped pid 25 with status 9","pos":"virt-launcher-monitor.go:202","timestamp":"2025-08-22T07:43:02.738130Z"} all VMIMs for this VM Succeeded: [root@e44-h32-000-r650 ~]# oc get vmim -n vm-ns-41 | grep rhel9-4050 kubevirt-evacuation-gg5h8 Succeeded rhel9-4050 kubevirt-evacuation-skbbd Succeeded rhel9-4050 kubevirt-evacuation-tw282 Succeeded rhel9-4050 Looking at the oldest VMIM that caused it to land on d22-h02-000-r650, no issues reported: (which is now the node stuck on evicting this VM) status: migrationState: completed: true endTimestamp: "2025-08-22T07:43:02Z" [...] sourceNode: d41-h17-000-r660 sourcePod: virt-launcher-rhel9-4050-7f4p8 startTimestamp: "2025-08-22T07:43:00Z" targetDirectMigrationNodePorts: "36237": 49152 "44693": 0 targetNode: d22-h02-000-r650 targetNodeAddress: 10.130.18.127 targetNodeDomainDetected: true targetNodeDomainReadyTimestamp: "2025-08-22T07:43:02Z" targetPod: virt-launcher-rhel9-4050-rc7x5 phase: Succeeded
Version-Release number of selected component (if applicable):
OCP 4.19.3 kubevirt-hyperconverged-operator.v4.19.3
How reproducible:
Only hit this condition once so far
Steps to Reproduce:
1. 10K VMs across 100 namespaces, enable descheduler profile, drain all nodes (does not reproduce in past drains so likely a corner case)
Actual results:
Node cannot drain
Expected results:
VM can be evicted
- links to