Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-56659

VirtualMachineInstanceMigrations failing during upgrade under load

XMLWordPrintable

    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • ---
    • ---
    • Critical
    • None

      Description of problem:

      VirtualMachineInstanceMigrations failing during upgrade under load

      Version-Release number of selected component (if applicable):

      4.18.0.rhel9-813

      How reproducible:

      Reproduced multiple times during the same workload update cycle.

      Steps to Reproduce:

      1. Deploy OCP/LSO/ODF/CNV 4.16
      2. Perform EUS upgrade with internal VM load
      3. Upgrade OCP/LSO/ODF/CNV to 4.18
      4. Enable LiveMigrate prior to unpausing worker MCP

      Actual results:

      After 24h, only 436 VM have migrated successfully. The other 1564 VM still have the old virt-launcher.
      
      After scaling back CNV and resetting the migrations, as the combined number of Pending and Undefined(Blank) Migrations approaches 500, the migrations begin to fail again.
      
      
      Errors seen include:
      
      virt-controller:
      ...
      2025-02-15T14:31:51.482528488Z W0215 14:31:51.482514       1 shared_informer.go:597] resyncPeriod 5m0s is smaller than resyncCheckPeriod 13h27m2.088196035s and the informer has already started. Changing it to 13h27m2.088196035s
      ...
      2025-02-15T14:32:25.439656573Z {"component":"virt-controller","kind":"DataVolume","level":"info","msg":"Looking for DataVolume Ref","name":"vm-instancetype-cirros-test-0000-volume","namespace":"default","pos":"vm.go:2283","timestamp":"2025-02-15T14:32:25.439454Z","uid":"4a56c8c7-a628-473a-870e-3cc5b8da15a2"}
      2025-02-15T14:32:25.439656573Z {"component":"virt-controller","kind":"DataVolume","level":"error","msg":"Cant find the matching VM for DataVolume: vm-instancetype-cirros-test-0000-volume","name":"vm-instancetype-cirros-test-0000-volume","namespace":"default","pos":"vm.go:2294","timestamp":"2025-02-15T14:32:25.439525Z","uid":"4a56c8c7-a628-473a-870e-3cc5b8da15a2"}
      ...
      "Cant find the matching VM for DataVolume..." appears to be repeated for all VMs
       
      The rest of the log is filled with this error along with the occasional pdb size increase/decrease
      "patching of vmi conditions and activePods failed: the server rejected our request due to an error in our request"
      2025-02-15T16:15:29.970377919Z {"component":"virt-controller","level":"info","msg":"reenqueuing VirtualMachineInstance default/vm-instancetype-cirros-test-1522","pos":"vmi.go:253","reason":"patching of vmi conditions and activePods failed: the server rejected our request due to an error in our request","timestamp":"2025-02-15T16:15:29.970260Z"} 
       
      virt-handler errors seen:
      2025-02-15T17:52:18.234412246Z {"component":"virt-handler","level":"error","msg":"failed to scrape metrics from //pods/77ef52ad-a310-4b0a-89dc-ad28ce92da55/volumes/kubernetes.io~empty-dir/sockets/launcher-sock","pos":"scrapper.go:48","reason":"failed to connect to cmd client socket: context deadline exceeded","timestamp":"2025-02-15T17:52:18.225320Z"}
      ...
      2025-02-15T17:53:30.083204851Z {"component":"virt-handler","kind":"","level":"error","msg":"target migration listener is not up for this vmi","name":"vm-instancetype-cirros-test-1986","namespace":"default","pos":"vm.go:819","timestamp":"2025-02-15T17:53:30.083162Z","uid":"2946fe73-5b4f-40ac-8f74-02a74ae78152"}
      ...
      2025-02-15T16:31:29.977835795Z {"component":"virt-handler","level":"error","msg":"failed to scrape domain stats for VMI default/vm-instancetype-cirros-test-0668","pos":"queue.go:104","reason":"expected 1 value from DomainstatsScraper, got 0","timestamp":"2025-02-15T16:31:29.977796Z"} 
      
      
      

      Expected results:

      All migrations successful

      Additional info:

      Load within VMs was very high, along with memory and disk activity.
      Reduced VM CPU load with no change.
      Increased OSD cpus from 4 to 8 to 12 due to OSDCPULoadHigh, had no effect.

       

              kbidarka@redhat.com Kedar Bidarkar
              rhn-support-sbennert Sarah Bennert
              Kedar Bidarkar Kedar Bidarkar
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: