Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-67948

Node stuck on VM eviction, prior src pod in NotReady state

XMLWordPrintable

    • Quality / Stability / Reliability
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None

      Description of problem:

      During 100 node / 10K VM high scale testing, after multiple node drain + descheduler testing rounds, one node is stuck draining on a single VM for many hours:
      
      oc logs machine-config-controller-5b5dcb74d4-tldmz -n openshift-machine-config-operator | grep d22-h02-000-r650
      
      I0825 13:38:38.638362       1 drain_controller.go:193] node d22-h02-000-r650: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: error when evicting pods/"virt-launcher-rhel9-4050-rc7x5" -n "vm-ns-41": global timeout reached: 1m30s
      
      the blocking VM was migrated to that node "Successfully" but the src pod didn't go to Complete it is now showing NotReady, not sure if that is the reason a new VMIM cannot be processed.... 
      
      virt-launcher-rhel9-4050-7f4p8   1/2     NotReady    0          3d18h   10.129.36.78    d41-h17-000-r660   <none>           1/1
      virt-launcher-rhel9-4050-rc7x5   2/2     Running     0          3d6h    10.130.18.145   d22-h02-000-r650   <none>           1/1
      
      
      That virt-launcher-rhel9-4050-7f4p8 pod log:
      
      {"component":"virt-launcher","kind":"","level":"warning","msg":"failed to get domain job info, will retry","name":"rhel9-4050","namespace":"vm-ns-41","pos":"live-migration-source.go:661","reason":"virError(Code=42, Domain=10, Message='Domain not found: no domain with matching uuid '72e68b80-01fa-5153-bb1d-894198f47279' (vm-ns-41_rhel9-4050)')","timestamp":"2025-08-22T07:43:02.732666Z","uid":"a5211a6a-f724-4692-83e5-2a38c78cec07"}
      {"component":"virt-launcher","level":"info","msg":"Process vm-ns-41_rhel9-4050 and pid 65 is gone!","pos":"monitor.go:179","timestamp":"2025-08-22T07:43:02.732775Z"}
      {"component":"virt-launcher","level":"info","msg":"Waiting on final notifications to be sent to virt-handler.","pos":"virt-launcher.go:282","timestamp":"2025-08-22T07:43:02.732812Z"}
      {"component":"virt-launcher","level":"info","msg":"Final Delete notification sent","pos":"virt-launcher.go:297","timestamp":"2025-08-22T07:43:02.732829Z"}
      {"component":"virt-launcher","kind":"","level":"warning","msg":"failed to get domain job info, will retry","name":"rhel9-4050","namespace":"vm-ns-41","pos":"live-migration-source.go:661","reason":"virError(Code=42, Domain=10, Message='Domain not found: no domain with matching uuid '72e68b80-01fa-5153-bb1d-894198f47279' (vm-ns-41_rhel9-4050)')","timestamp":"2025-08-22T07:43:02.732791Z","uid":"a5211a6a-f724-4692-83e5-2a38c78cec07"}
      {"component":"virt-launcher","level":"info","msg":"stopping cmd server","pos":"server.go:625","timestamp":"2025-08-22T07:43:02.732882Z"}
      {"component":"virt-launcher","level":"info","msg":"cmd server stopped","pos":"server.go:634","timestamp":"2025-08-22T07:43:02.733086Z"}
      {"component":"virt-launcher","level":"info","msg":"Exiting...","pos":"virt-launcher.go:513","timestamp":"2025-08-22T07:43:02.733123Z"}
      {"component":"virt-launcher-monitor","level":"info","msg":"Reaped pid 25 with status 9","pos":"virt-launcher-monitor.go:202","timestamp":"2025-08-22T07:43:02.738130Z"}
      
      all VMIMs for this VM Succeeded:
      [root@e44-h32-000-r650 ~]# oc get vmim -n vm-ns-41 | grep rhel9-4050
      kubevirt-evacuation-gg5h8   Succeeded   rhel9-4050
      kubevirt-evacuation-skbbd   Succeeded   rhel9-4050
      kubevirt-evacuation-tw282   Succeeded   rhel9-4050
      
      Looking at the oldest VMIM that caused it to land on d22-h02-000-r650, no issues reported:  (which is now the node stuck on evicting this VM)
      
      status:
        migrationState:
          completed: true
          endTimestamp: "2025-08-22T07:43:02Z"
      [...]
          sourceNode: d41-h17-000-r660
          sourcePod: virt-launcher-rhel9-4050-7f4p8
          startTimestamp: "2025-08-22T07:43:00Z"
          targetDirectMigrationNodePorts:
            "36237": 49152
            "44693": 0
          targetNode: d22-h02-000-r650
          targetNodeAddress: 10.130.18.127
          targetNodeDomainDetected: true
          targetNodeDomainReadyTimestamp: "2025-08-22T07:43:02Z"
          targetPod: virt-launcher-rhel9-4050-rc7x5
        phase: Succeeded

      Version-Release number of selected component (if applicable):

      OCP 4.19.3
      kubevirt-hyperconverged-operator.v4.19.3

      How reproducible:

      Only hit this condition once so far

      Steps to Reproduce:

      1. 10K VMs across 100 namespaces, enable descheduler profile, drain all nodes (does not reproduce in past drains so likely a corner case)
      

      Actual results:

      Node cannot drain

      Expected results:

      VM can be evicted

              lpivarc Luboslav Pivarc
              jhopper@redhat.com Jenifer Abrams
              Denys Shchedrivyi Denys Shchedrivyi
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: