Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: CNV v4.18.z
Affects Version/s: CNV v4.18.13
Component/s: CNV Virt-Node
Labels:
- CEE.neXT

Activity Type:
Quality / Stability / Reliability
Story Points:
0.42
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Component Fix Version(s):
None
Market:

Sprint:
CNV Virt-Cluster Sprint 278
Severity:
Important
Customer Impact:

Customer Reported

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Description of problem:

The issue is observed in clusters with nodes having Emerald Rapids CPU family, where the “host-model” reported by libvirtd changed from Sapphire Rapids to Sierra Forest after the upgrade.

The CPU model of the nodes in this cluster is INTEL(R) XEON(R) GOLD 6548Y+ which belongs to the Emerald Rapids family[1]. Libvirt and QEMU don’t have Emerald Rapids CPU family and the “host-model” reported by the libvirtd before upgrade was SapphireRapids:

domcapabilities output from one of the virt-launcher pod which was not upgraded:

    <mode name='host-model' supported='yes'>
      <model fallback='forbid'>SapphireRapids</model>

It also reports “model usable='no' for SapphireRapids:

      <model usable='no' vendor='Intel'>SapphireRapids</model>

virt-handler labels the nodes with the CPU family reported as usable='yes' as well as the one reported in the host-model. So the SapphireRapids CPU label was applied previously on all the nodes since it is returned as "host-model".

The SierraForest CPU family support was added in libvirt version 10.0.0-6.19.el9_4 [2]. OpenShift Virtualization 4.18.11 was having version 10.0.0-6.17.el9_4.x86_64. After upgrading OpenShift Virtualization to 4.18.13, which includes the version with SierraForest support, libvirt began reporting the host-model as SierraForest instead of SapphireRapids on all the nodes.

    <mode name='host-model' supported='yes'>
      <model fallback='forbid'>SierraForest</model>
      <vendor>Intel</vendor>
….
….
    <mode name='custom' supported='yes'>
….
      <model usable='no' vendor='Intel'>SierraForest</model>

As a result, the node was no longer labeled with SapphireRapids after the upgrade.

# oc get node <node-name> -o yaml |grep " cpu-model-migration"
    cpu-model-migration.node.kubevirt.io/Broadwell-noTSX: 'true'
    cpu-model-migration.node.kubevirt.io/Broadwell-noTSX-IBRS: 'true'
    cpu-model-migration.node.kubevirt.io/Cascadelake-Server-noTSX: 'true'
    cpu-model-migration.node.kubevirt.io/Haswell-noTSX: 'true'
    cpu-model-migration.node.kubevirt.io/Haswell-noTSX-IBRS: 'true'
    cpu-model-migration.node.kubevirt.io/Icelake-Server-noTSX: 'true'
    cpu-model-migration.node.kubevirt.io/IvyBridge: 'true'
    cpu-model-migration.node.kubevirt.io/IvyBridge-IBRS: 'true'
    cpu-model-migration.node.kubevirt.io/Nehalem: 'true'
    cpu-model-migration.node.kubevirt.io/Nehalem-IBRS: 'true'
    cpu-model-migration.node.kubevirt.io/Penryn: 'true'
    cpu-model-migration.node.kubevirt.io/SandyBridge: 'true'
    cpu-model-migration.node.kubevirt.io/SandyBridge-IBRS: 'true'
    cpu-model-migration.node.kubevirt.io/SierraForest: 'true'
    cpu-model-migration.node.kubevirt.io/Skylake-Client-noTSX-IBRS: 'true'
    cpu-model-migration.node.kubevirt.io/Skylake-Server-noTSX-IBRS: 'true'
    cpu-model-migration.node.kubevirt.io/Westmere: 'true'
    cpu-model-migration.node.kubevirt.io/Westmere-IBRS: 'true'

When a VM is migrated, the destination virt-launcher pod is created with a node selector pointing to the "host-model" reported in the source node . Since it was SapphireRapids host-model previously before the upgrade, any VMs that were live migrated before the upgrade have virt-launcher pods with nodeselector as SapphireRapids:

# oc get pod virt-launcher-<vm-name-kz6wq -o yaml |yq '.spec.nodeSelector' |grep migration

cpu-model-migration.node.kubevirt.io/SapphireRapids: 'true'

When we initiate the migration of these VMs, it will fail to schedule the target virt-launcher pod since none of the nodes currently have this label after the upgrade.

[1] https://www.intel.com/content/www/us/en/products/sku/237564/intel-xeon-gold-6548y-processor-60m-cache-2-50-ghz/specifications.html
[2] https://access.redhat.com/errata/RHBA-2025:13666

Version-Release number of selected component (if applicable):

OpenShift Virtualization 4.18.13

How reproducible:

Observed in customer environment

Steps to Reproduce:

Issue should be reproducible with following steps.

1. Create a 4.18.11 cluster with nodes having Emerald Rapids family.
2. Do a VM live migration so that target virt-launcher pods will be created with SapphireRapids family.
3. Upgrade the cluster to 4.18.13.
4. Try live migrating the previously migrated VM after the upgrade. The destination pod will fail to schedule.

Actual results:

VM live migration breaks after upgrading the cluster to 4.18.13 on cluster with nodes having Emerald Rapids CPU family

Expected results:

Additional info:

is related to

CNV-69717 [CLOSED LOOP for] VM live migration breaks after upgrading the cluster to 4.18.13

New

links to

KCS

migration bugfix: relax host-model check to allow known models

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates