-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
None
-
-
None
Description of problem:
During 100 node / 10K VM high scale testing, after multiple node drain + descheduler testing rounds, one node is stuck draining on a single VM for many hours:
oc logs machine-config-controller-5b5dcb74d4-tldmz -n openshift-machine-config-operator | grep d22-h02-000-r650
I0825 13:38:38.638362 1 drain_controller.go:193] node d22-h02-000-r650: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when evicting pods/"virt-launcher-rhel9-4050-rc7x5" -n "vm-ns-41": global timeout reached: 1m30s
the blocking VM was migrated to that node "Successfully" but the src pod didn't go to Complete it is now showing NotReady, not sure if that is the reason a new VMIM cannot be processed....
virt-launcher-rhel9-4050-7f4p8 1/2 NotReady 0 3d18h 10.129.36.78 d41-h17-000-r660 <none> 1/1
virt-launcher-rhel9-4050-rc7x5 2/2 Running 0 3d6h 10.130.18.145 d22-h02-000-r650 <none> 1/1
That virt-launcher-rhel9-4050-7f4p8 pod log:
{"component":"virt-launcher","kind":"","level":"warning","msg":"failed to get domain job info, will retry","name":"rhel9-4050","namespace":"vm-ns-41","pos":"live-migration-source.go:661","reason":"virError(Code=42, Domain=10, Message='Domain not found: no domain with matching uuid '72e68b80-01fa-5153-bb1d-894198f47279' (vm-ns-41_rhel9-4050)')","timestamp":"2025-08-22T07:43:02.732666Z","uid":"a5211a6a-f724-4692-83e5-2a38c78cec07"}
{"component":"virt-launcher","level":"info","msg":"Process vm-ns-41_rhel9-4050 and pid 65 is gone!","pos":"monitor.go:179","timestamp":"2025-08-22T07:43:02.732775Z"}
{"component":"virt-launcher","level":"info","msg":"Waiting on final notifications to be sent to virt-handler.","pos":"virt-launcher.go:282","timestamp":"2025-08-22T07:43:02.732812Z"}
{"component":"virt-launcher","level":"info","msg":"Final Delete notification sent","pos":"virt-launcher.go:297","timestamp":"2025-08-22T07:43:02.732829Z"}
{"component":"virt-launcher","kind":"","level":"warning","msg":"failed to get domain job info, will retry","name":"rhel9-4050","namespace":"vm-ns-41","pos":"live-migration-source.go:661","reason":"virError(Code=42, Domain=10, Message='Domain not found: no domain with matching uuid '72e68b80-01fa-5153-bb1d-894198f47279' (vm-ns-41_rhel9-4050)')","timestamp":"2025-08-22T07:43:02.732791Z","uid":"a5211a6a-f724-4692-83e5-2a38c78cec07"}
{"component":"virt-launcher","level":"info","msg":"stopping cmd server","pos":"server.go:625","timestamp":"2025-08-22T07:43:02.732882Z"}
{"component":"virt-launcher","level":"info","msg":"cmd server stopped","pos":"server.go:634","timestamp":"2025-08-22T07:43:02.733086Z"}
{"component":"virt-launcher","level":"info","msg":"Exiting...","pos":"virt-launcher.go:513","timestamp":"2025-08-22T07:43:02.733123Z"}
{"component":"virt-launcher-monitor","level":"info","msg":"Reaped pid 25 with status 9","pos":"virt-launcher-monitor.go:202","timestamp":"2025-08-22T07:43:02.738130Z"}
all VMIMs for this VM Succeeded:
[root@e44-h32-000-r650 ~]# oc get vmim -n vm-ns-41 | grep rhel9-4050
kubevirt-evacuation-gg5h8 Succeeded rhel9-4050
kubevirt-evacuation-skbbd Succeeded rhel9-4050
kubevirt-evacuation-tw282 Succeeded rhel9-4050
Looking at the oldest VMIM that caused it to land on d22-h02-000-r650, no issues reported: (which is now the node stuck on evicting this VM)
status:
migrationState:
completed: true
endTimestamp: "2025-08-22T07:43:02Z"
[...]
sourceNode: d41-h17-000-r660
sourcePod: virt-launcher-rhel9-4050-7f4p8
startTimestamp: "2025-08-22T07:43:00Z"
targetDirectMigrationNodePorts:
"36237": 49152
"44693": 0
targetNode: d22-h02-000-r650
targetNodeAddress: 10.130.18.127
targetNodeDomainDetected: true
targetNodeDomainReadyTimestamp: "2025-08-22T07:43:02Z"
targetPod: virt-launcher-rhel9-4050-rc7x5
phase: Succeeded
Version-Release number of selected component (if applicable):
OCP 4.19.3 kubevirt-hyperconverged-operator.v4.19.3
How reproducible:
Only hit this condition once so far
Steps to Reproduce:
1. 10K VMs across 100 namespaces, enable descheduler profile, drain all nodes (does not reproduce in past drains so likely a corner case)
Actual results:
Node cannot drain
Expected results:
VM can be evicted
- links to