Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-68555

VM live migration breaks after upgrading the cluster to 4.18.13

XMLWordPrintable

    • Quality / Stability / Reliability
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • CNV Virt-Cluster Sprint 278
    • Important
    • Customer Reported
    • None

      Description of problem:

      The issue is observed in clusters with nodes having Emerald Rapids CPU family, where the “host-model” reported by libvirtd changed from Sapphire Rapids to Sierra Forest after the upgrade.

      The CPU model of the nodes in this cluster is INTEL(R) XEON(R) GOLD 6548Y+ which belongs to the Emerald Rapids family[1]. Libvirt and QEMU don’t have Emerald Rapids CPU family and the “host-model” reported by the libvirtd before upgrade was SapphireRapids:

      domcapabilities output from one of the virt-launcher pod which was not upgraded:

          <mode name='host-model' supported='yes'>
            <model fallback='forbid'>SapphireRapids</model>

       

      It also reports “model usable='no' for SapphireRapids: 

            <model usable='no' vendor='Intel'>SapphireRapids</model>

       

      virt-handler labels the nodes with the CPU family reported as usable='yes' as well as the one reported in the host-model. So the SapphireRapids CPU label was applied previously on all the nodes since it is returned as "host-model".

      The SierraForest CPU family support was added in libvirt version 10.0.0-6.19.el9_4 [2]. OpenShift Virtualization 4.18.11 was having version 10.0.0-6.17.el9_4.x86_64. After upgrading OpenShift Virtualization to 4.18.13, which includes the version with SierraForest support, libvirt began reporting the host-model as SierraForest instead of SapphireRapids on all the nodes.

          <mode name='host-model' supported='yes'>
            <model fallback='forbid'>SierraForest</model>
            <vendor>Intel</vendor>
      ….
      ….
          <mode name='custom' supported='yes'>
      ….
            <model usable='no' vendor='Intel'>SierraForest</model>

       

      As a result, the node was no longer labeled with SapphireRapids after the upgrade.

      # oc get node <node-name> -o yaml |grep " cpu-model-migration"
          cpu-model-migration.node.kubevirt.io/Broadwell-noTSX: 'true'
          cpu-model-migration.node.kubevirt.io/Broadwell-noTSX-IBRS: 'true'
          cpu-model-migration.node.kubevirt.io/Cascadelake-Server-noTSX: 'true'
          cpu-model-migration.node.kubevirt.io/Haswell-noTSX: 'true'
          cpu-model-migration.node.kubevirt.io/Haswell-noTSX-IBRS: 'true'
          cpu-model-migration.node.kubevirt.io/Icelake-Server-noTSX: 'true'
          cpu-model-migration.node.kubevirt.io/IvyBridge: 'true'
          cpu-model-migration.node.kubevirt.io/IvyBridge-IBRS: 'true'
          cpu-model-migration.node.kubevirt.io/Nehalem: 'true'
          cpu-model-migration.node.kubevirt.io/Nehalem-IBRS: 'true'
          cpu-model-migration.node.kubevirt.io/Penryn: 'true'
          cpu-model-migration.node.kubevirt.io/SandyBridge: 'true'
          cpu-model-migration.node.kubevirt.io/SandyBridge-IBRS: 'true'
          cpu-model-migration.node.kubevirt.io/SierraForest: 'true'
          cpu-model-migration.node.kubevirt.io/Skylake-Client-noTSX-IBRS: 'true'
          cpu-model-migration.node.kubevirt.io/Skylake-Server-noTSX-IBRS: 'true'
          cpu-model-migration.node.kubevirt.io/Westmere: 'true'
          cpu-model-migration.node.kubevirt.io/Westmere-IBRS: 'true'
      

      When a VM is migrated, the destination virt-launcher pod is created with a node selector pointing to the "host-model" reported in the source node . Since it was SapphireRapids host-model previously before the upgrade, any VMs that were live migrated before the upgrade have virt-launcher pods with nodeselector as SapphireRapids:

       

      # oc get pod virt-launcher-<vm-name-kz6wq -o yaml |yq '.spec.nodeSelector' |grep migration
      
      cpu-model-migration.node.kubevirt.io/SapphireRapids: 'true'  

       

      When we initiate the migration of these VMs, it will fail to schedule the target virt-launcher pod since none of the nodes currently have this label after the upgrade.

      [1] https://www.intel.com/content/www/us/en/products/sku/237564/intel-xeon-gold-6548y-processor-60m-cache-2-50-ghz/specifications.html
      [2] https://access.redhat.com/errata/RHBA-2025:13666

      Version-Release number of selected component (if applicable):

      OpenShift Virtualization 4.18.13

      How reproducible:

      Observed in customer environment

      Steps to Reproduce:

      Issue should be reproducible with following steps.
      
      1. Create a 4.18.11 cluster with nodes having Emerald Rapids family.
      2. Do a VM live migration so that target virt-launcher pods will be created with SapphireRapids family.
      3. Upgrade the cluster to 4.18.13.
      4. Try live migrating the previously migrated VM after the upgrade. The destination pod will fail to schedule.   

      Actual results:

      VM live migration breaks after upgrading the cluster to 4.18.13 on cluster with nodes having Emerald Rapids CPU family

      Expected results:

       

      Additional info:

       

              bmordeha@redhat.com Barak Mordehai
              rhn-support-nashok Nijin Ashok
              Denys Shchedrivyi Denys Shchedrivyi
              Votes:
              1 Vote for this issue
              Watchers:
              18 Start watching this issue

                Created:
                Updated: