Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-19214

[2099216] VMs fail to start on some specific environment (Icelake)

XMLWordPrintable

    • CNV Virtualization Sprint 222, CNV Virtualization Sprint 223, CNV Virtualization Sprint 224, CNV Virtualization Sprint 225
    • Important
    • None

      Description of problem: I tried to create a VM with guest agent with the following spec:
      http://pastebin.test.redhat.com/1059571
      datavolume: http://pastebin.test.redhat.com/1059572
      but I get this error message on the events of the VMI:

      Events:
      Type Reason Age From Message
      ---- ------ ---- ---- -------
      Normal SuccessfulCreate 55m virtualmachine-controller Created virtual machine pod virt-launcher-test-vm-l7fw2
      Normal Created 55m virt-handler VirtualMachineInstance defined.
      Normal Started 55m virt-handler VirtualMachineInstance started.
      Warning SyncFailed 2m9s (x31 over 55m) virt-handler server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required')"

      I checked the logs (tail -n 200 /var/log/libvirt/qemu/*.log) in the virt-launcher pod and I noticed this error:

      -msg timestamp=on
      KVM: entry failed, hardware error 0x8
      EAX=00000000 EBX=00000000 ECX=00000000 EDX=00080661
      ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
      EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
      ES =0000 00000000 0000ffff 00009300
      CS =f000 ffff0000 0000ffff 00009b00
      SS =0000 00000000 0000ffff 00009300
      DS =0000 00000000 0000ffff 00009300
      FS =0000 00000000 0000ffff 00009300
      GS =0000 00000000 0000ffff 00009300
      LDT=0000 00000000 0000ffff 00008200
      TR =0000 00000000 0000ffff 00008b00
      GDT= 00000000 0000ffff
      IDT= 00000000 0000ffff
      CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
      DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
      DR6=00000000ffff0ff0 DR7=0000000000000400
      EFER=0000000000000000
      Code=04 66 41 eb f1 66 83 c9 ff 66 89 c8 66 5b 66 5e 66 5f 66 c3 <ea> 5b e0 00 f0 30 36 2f 32 33 2f 39 39 00 fc 00 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

      also I checked on the virt-launcher pod the status of the VM:

      [mperetz@mperetz ~]$ oc rsh virt-launcher-simple-vm-kmnbc
      sh-4.4# virsh list
      Id Name State
      ----------------------------------
      1 default_simple-vm paused

      sh-4.4# exut\
      > ^C
      sh-4.4# exit
      exit
      command terminated with exit code 130
      [mperetz@mperetz ~]$ oc get vmi
      NAME AGE PHASE IP NODENAME READY
      simple-vm 6m59s Running 10.128.2.40 oadp-12290-wqlcn-worker-0-llq8b True
      [mperetz@mperetz ~]$

      additional details:
      lscpu of the worker nodes: http://pastebin.test.redhat.com/1059422
      OCP version: 4.10 (OpenStack on PSI). Also tried 4.9.
      Openstack flavor: ci.m1.xlarge
      lscpu output:
      sh-4.4# lscpu
      Architecture: x86_64
      CPU op-mode(s): 32-bit, 64-bit
      Byte Order: Little Endian
      CPU(s): 8
      On-line CPU(s) list: 0-7
      Thread(s) per core: 1
      Core(s) per socket: 1
      Socket(s): 8
      NUMA node(s): 1
      Vendor ID: GenuineIntel
      BIOS Vendor ID: Red Hat
      CPU family: 6
      Model: 134
      Model name: Intel Xeon Processor (Icelake)
      BIOS Model name: RHEL 7.6.0 PC (i440FX + PIIX, 1996)
      Stepping: 0
      CPU MHz: 2294.608
      BogoMIPS: 4589.21
      Virtualization: VT-x
      Hypervisor vendor: KVM
      Virtualization type: full
      L1d cache: 32K
      L1i cache: 32K
      L2 cache: 4096K
      L3 cache: 16384K
      NUMA node0 CPU(s): 0-7
      Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 md_clear arch_capabilities

      I'm not sure what exactly causes the issue based on the error message.
      I also tried it on OCP 4.10 with the same CNV version, but with Openstack flavor ci.standard.xl and with a different server for the worker nodes:

      sh-4.4# lscpu
      Architecture: x86_64
      CPU op-mode(s): 32-bit, 64-bit
      Byte Order: Little Endian
      CPU(s): 8
      On-line CPU(s) list: 0-7
      Thread(s) per core: 1
      Core(s) per socket: 1
      Socket(s): 8
      NUMA node(s): 1
      Vendor ID: GenuineIntel
      BIOS Vendor ID: Red Hat
      CPU family: 6
      Model: 85
      Model name: Intel Xeon Processor (Skylake, IBRS)
      BIOS Model name: RHEL 7.6.0 PC (i440FX + PIIX, 1996)
      Stepping: 4
      CPU MHz: 2095.076
      BogoMIPS: 4190.15
      Virtualization: VT-x
      Hypervisor vendor: KVM
      Virtualization type: full
      L1d cache: 32K
      L1i cache: 32K
      L2 cache: 4096K
      L3 cache: 16384K
      NUMA node0 CPU(s): 0-7
      Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke md_clear arch_capabilities

      and there it works.

      Version-Release number of selected component (if applicable): CNV version: 4.9/4.10.2 (production)

      How reproducible: 100% on the specific platform with the Icelake cpu-model

      Steps to Reproduce:
      not sure exactly what is the root cause as mentioned above, but that's how I reproduce:
      1. Create with flexy-install job openstack cluster on PSI, with OCP version 4.10 and flavor ci.m1.xlarge (which usually deploys the worker nodes on a server with the Icelake CPU model).
      2. deploy the following data volume and VM (happened also with other templates, like alpine, so not necessarily these exact templates are required):
      http://pastebin.test.redhat.com/1059571
      datavolume: http://pastebin.test.redhat.com/1059572
      3. check the events of the VMI. Note you get this error evnetually:
      "LibvirtError(Code=1, Domain=10, Message='internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required')"
      4. Look for the other logs/statuses as mentioned in the problem description.

      Actual results:

      Expected results:

      Additional info:

              jelejosne Jed Lejosne
              mperetz@redhat.com Maya Peretz
              Kedar Bidarkar Kedar Bidarkar
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: