Loading...

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: CNV v4.21.0
Affects Version/s: None
Component/s: CNV Virt-Node
Labels:

Activity Type:
Quality / Stability / Reliability
Story Points:
0.42
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Component Fix Version(s):
None
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Description of problem:

During 100 node / 10K VM high scale testing, after multiple node drain + descheduler testing rounds, one node is stuck draining on a single VM for many hours:

oc logs machine-config-controller-5b5dcb74d4-tldmz -n openshift-machine-config-operator | grep d22-h02-000-r650

I0825 13:38:38.638362       1 drain_controller.go:193] node d22-h02-000-r650: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when evicting pods/"virt-launcher-rhel9-4050-rc7x5" -n "vm-ns-41": global timeout reached: 1m30s

the blocking VM was migrated to that node "Successfully" but the src pod didn't go to Complete it is now showing NotReady, not sure if that is the reason a new VMIM cannot be processed.... 

virt-launcher-rhel9-4050-7f4p8   1/2     NotReady    0          3d18h   10.129.36.78    d41-h17-000-r660   <none>           1/1
virt-launcher-rhel9-4050-rc7x5   2/2     Running     0          3d6h    10.130.18.145   d22-h02-000-r650   <none>           1/1


That virt-launcher-rhel9-4050-7f4p8 pod log:

{"component":"virt-launcher","kind":"","level":"warning","msg":"failed to get domain job info, will retry","name":"rhel9-4050","namespace":"vm-ns-41","pos":"live-migration-source.go:661","reason":"virError(Code=42, Domain=10, Message='Domain not found: no domain with matching uuid '72e68b80-01fa-5153-bb1d-894198f47279' (vm-ns-41_rhel9-4050)')","timestamp":"2025-08-22T07:43:02.732666Z","uid":"a5211a6a-f724-4692-83e5-2a38c78cec07"}
{"component":"virt-launcher","level":"info","msg":"Process vm-ns-41_rhel9-4050 and pid 65 is gone!","pos":"monitor.go:179","timestamp":"2025-08-22T07:43:02.732775Z"}
{"component":"virt-launcher","level":"info","msg":"Waiting on final notifications to be sent to virt-handler.","pos":"virt-launcher.go:282","timestamp":"2025-08-22T07:43:02.732812Z"}
{"component":"virt-launcher","level":"info","msg":"Final Delete notification sent","pos":"virt-launcher.go:297","timestamp":"2025-08-22T07:43:02.732829Z"}
{"component":"virt-launcher","kind":"","level":"warning","msg":"failed to get domain job info, will retry","name":"rhel9-4050","namespace":"vm-ns-41","pos":"live-migration-source.go:661","reason":"virError(Code=42, Domain=10, Message='Domain not found: no domain with matching uuid '72e68b80-01fa-5153-bb1d-894198f47279' (vm-ns-41_rhel9-4050)')","timestamp":"2025-08-22T07:43:02.732791Z","uid":"a5211a6a-f724-4692-83e5-2a38c78cec07"}
{"component":"virt-launcher","level":"info","msg":"stopping cmd server","pos":"server.go:625","timestamp":"2025-08-22T07:43:02.732882Z"}
{"component":"virt-launcher","level":"info","msg":"cmd server stopped","pos":"server.go:634","timestamp":"2025-08-22T07:43:02.733086Z"}
{"component":"virt-launcher","level":"info","msg":"Exiting...","pos":"virt-launcher.go:513","timestamp":"2025-08-22T07:43:02.733123Z"}
{"component":"virt-launcher-monitor","level":"info","msg":"Reaped pid 25 with status 9","pos":"virt-launcher-monitor.go:202","timestamp":"2025-08-22T07:43:02.738130Z"}

all VMIMs for this VM Succeeded:
[root@e44-h32-000-r650 ~]# oc get vmim -n vm-ns-41 | grep rhel9-4050
kubevirt-evacuation-gg5h8   Succeeded   rhel9-4050
kubevirt-evacuation-skbbd   Succeeded   rhel9-4050
kubevirt-evacuation-tw282   Succeeded   rhel9-4050

Looking at the oldest VMIM that caused it to land on d22-h02-000-r650, no issues reported:  (which is now the node stuck on evicting this VM)

status:
  migrationState:
    completed: true
    endTimestamp: "2025-08-22T07:43:02Z"
[...]
    sourceNode: d41-h17-000-r660
    sourcePod: virt-launcher-rhel9-4050-7f4p8
    startTimestamp: "2025-08-22T07:43:00Z"
    targetDirectMigrationNodePorts:
      "36237": 49152
      "44693": 0
    targetNode: d22-h02-000-r650
    targetNodeAddress: 10.130.18.127
    targetNodeDomainDetected: true
    targetNodeDomainReadyTimestamp: "2025-08-22T07:43:02Z"
    targetPod: virt-launcher-rhel9-4050-rc7x5
  phase: Succeeded

Version-Release number of selected component (if applicable):

OCP 4.19.3
kubevirt-hyperconverged-operator.v4.19.3

How reproducible:

Only hit this condition once so far

Steps to Reproduce:

1. 10K VMs across 100 namespaces, enable descheduler profile, drain all nodes (does not reproduce in past drains so likely a corner case)

Actual results:

Node cannot drain

Expected results:

VM can be evicted

links to

PR

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates