Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-60158

Virtual machine startup is stuck with error: "Failed to discover live migration status: unexpected live migration state at pods"

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      In the live migration status check, the virt-launcher pods of a VM is identified by looking at pods with label vm.kubevirt.io/name: <vm-name>. OpenShift Virtualisation automatically sets this label on the virt-launcher pod and by default to VM name. But this label value can be overridden by passing ".spec.template.spec.hostname" in the VM spec. Refer sanitize.go#L33 and template.go#L1572

      For example, the following VM has the hostname 'test-vm',  the virt-launcher pod is automatically labeled with 'vm.kubevirt.io/name: test-vm':

      # oc get vm rhel9-azure-earthworm-57 -o json |jq '.spec.template.spec.hostname'
      
      "test-vm"
      
      
      # oc get pod virt-launcher-rhel9-azure-earthworm-57-6ddr6 -o json|jq '.metadata.labels'
      {
        "kubevirt.io": "virt-launcher",
        "kubevirt.io/created-by": "ffc3ae7a-e048-4032-bddd-6399fb1ec5af",
        "kubevirt.io/domain": "rhel9-azure-earthworm-57",
        "kubevirt.io/nodeName": "10.74.128.230",
        "kubevirt.io/size": "small",
        "network.kubevirt.io/headlessService": "headless",
        "vm.kubevirt.io/name": "test-vm"                     <===
      }
      
      
      

      So the label value may not be unique across all virtual machines.

      If there is another VM in the namespace which have the same spec.hostname, the OVN incorrectly finds this VM's virt-launcher pod while looking for "livingPods".  When there are more than 2 VMs with same spec.hostname, the third VM fails to start with tooManyPodsError at kubevirt/pod.go#L494

      3 VMs with the same hostname:

      # oc get vm -o json | jq -r '.items[] | "\(.metadata.name) \(.spec.template.spec.hostname)"'
      
      rhel9-azure-earthworm-57 test-vm
      rhel9-beige-jellyfish-19 test-vm
      rhel9-coral-anteater-76 test-vm

      Out of this, 2 of it was running, and the third one is stuck in starting state:

      NAME                       AGE    STATUS     READY
      rhel9-azure-earthworm-57   56m    Running    True
      rhel9-beige-jellyfish-19   57m    Running    True
      rhel9-coral-anteater-76    57m    Starting   False
      
      
      

      The ovnkube-controller have following error:

      I0805 18:29:51.270393    6978 base_network_controller_pods.go:477] [default/nijin-cnv/virt-launcher-rhel9-coral-anteater-76-7qc9m] creating logical port nijin-cnv_virt-launcher-rhel9-coral-anteater-76-7qc9m for pod on switch 18-66-da-9f-a6-de
      
      E0805 18:29:51.285387    6978 obj_retry.go:684] Failed to update *v1.Pod, old=nijin-cnv/virt-launcher-rhel9-coral-anteater-76-7qc9m, new=nijin-cnv/virt-launcher-rhel9-coral-anteater-76-7qc9m, error: failed to discover Live-migration status: unexpected live migration state at pods: nijin-cnv/virt-launcher-rhel9-azure-earthworm-57-6ddr6,nijin-cnv/virt-launcher-rhel9-beige-jellyfish-19-w9ps2,nijin-cnv/virt-launcher-rhel9-coral-anteater-76-7qc9m

      The error is pointing to the other two pods which have the same hostname.

      Version-Release number of selected component (if applicable):

      OpenShift 4.18.11

      OpenShift Virtualization   4.18.11

       

      How reproducible:

      100%

      Steps to Reproduce:

      1. Create 3 VMs with same "spec.template.spec.hostname" value.

      2. Attach localnet secondary network on all the VMs.

      3. Start all the 3 VMs. One of it will stuck in "starting" state with the above error. 

      Actual results:

      Virtual machine startup is stuck with error: "Failed to discover live migration status: unexpected live migration state at pods"

      Expected results:

      The vm.kubevirt.io/name: <vm-name> label __ which is used to find the virt-launcher pods is not necessary to be unique in the namespace and can cause unexpected results. Tjhis probably need addition validation.

              phoracek@redhat.com Petr Horacek
              rhn-support-nashok Nijin Ashok
              None
              None
              Anurag Saxena Anurag Saxena
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: