-
Bug
-
Resolution: Won't Do
-
Major
-
None
-
False
-
False
-
CLOSED
-
CNV Virtualization Sprint 209, CNV Virtualization Sprint 210, CNV Doc Sprint 212
-
High
-
None
CNV cluster with 24+ nodes, 850 virtual machines
Windows 10 VM's seem to fall offline. When using the UI console - screen shows blank.
For some of the Windows logs we see:
Event log shows "Reset to device, \Device\RaidPort2, was issued. "
also pods are showing:
error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "<sandbox_id>" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container for pod sandbox <sandbox_id>: failed to stop container k8s_compute_virt-launcher-<pod>.virtualmachines_<container_id>: context deadline exceeded"]
This seemed to happen after a mass windows update:
The guest was Windows 10 all updates.
Then these patches were applied to the Windows VM’s:
KB5005700
KB5005566
After this, 150 out of 700 went rogue and had the symptoms described above.
sample windows VM yaml
—
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
labels:
kubevirt.io/vm: <$VM>
name: <$VM>
namespace: virtualmachines
spec:
dataVolumeTemplates:
- metadata:
name: <$VM>
spec:
pvc:
accessModes: - ReadWriteMany
resources:
requests:
storage: 100Gi
storageClassName: ocs-storagecluster-ceph-rbd
volumeMode: Block
source:
blank: {}
status: {}
running: false
template:
metadata:
creationTimestamp: null
labels:
kubevirt.io/vm: <$VM>
spec:
domain:
clock:
timer:
hpet:
present: false
hyperv: {}
pit:
tickPolicy: delay
rtc:
tickPolicy: catchup
utc: {}
cpu:
cores: 1
model: host-model
sockets: 2
devices:
disks: - bootOrder: 2
disk:
bus: virtio
pciAddress: "0000:00:02.0"
name: os-disk
interfaces: - bootOrder: 1
bridge: {}
macAddress: <$MAC>
name: vnic0
pciAddress: "0000:00:03.0"
networkInterfaceMultiqueue: true
features:
acpi: {}
apic: {}
hyperv:
evmcs: {}
frequencies: {}
ipi: {}
reenlightenment: {}
relaxed: {}
reset: {}
runtime: {}
spinlocks:
spinlocks: 8191
synic: {}
synictimer: {}
tlbflush: {}
vapic: {}
vpindex: {}
firmware:
uuid: <$UUID>
resources:
requests:
cpu: 1500m
memory: 11Gi
networks: - multus:
networkName: <$VLAN_ID>
name: vnic0
terminationGracePeriodSeconds: 30
evictionStrategy: LiveMigrate
volumes: - dataVolume:
name: <$VOL_NAME>
name: os-disk
status: {}
- external trackers