Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57594

Calico CNI breaks node connectivity during live migration of Kubevirt VMs

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      As seen in this run, the following test fails:

      [sig-kubevirt] migration when running openshift cluster on KubeVirt virtual machines and live migrate hosted control plane workers [Early] should maintain node readiness [Suite:openshift/conformance/parallel]

      The test fails with one node loosing readiness:

      {  fail [github.com/openshift/origin/test/extended/kubevirt/migration.go:83]: Failed after 45.771s.
      nodes should maintain ready state
      Expected
          <int>: 2
      to equal
          <int>: 3
      Ginkgo exit error 1: exit with code 1} 

      The live migration was discussed in this thread: slack

      Summary of the issue:

      * Mother node-A get a block of 10.10.10.0/26
      * Kubevirt vm lands on mother node-A and both the launcher pod and VM inside gets IP from this block -> 10.10.10.1
      * Everything OK at this point, mother node-A advertises the /26 block to the upstream router - All good
      * Kubevirt vm migrates to mother node-B
      * At this point the virt-launcher pod gets a new IP address (e.g. 10.10.10.128) but the VM inside retains it’s IP address of 10.10.10.1
      * Mother node-B does not advertise the address of the inside VM because it only sees the launcher pod
      * Upstream router just has a route to 10.10.10.0/26 to mother node-A and networking is lost to the migrated kubevirt VM 

       

      Version-Release number of selected component (if applicable):

          OCP 4.19, Calico 3.29.3

      How reproducible:

          Always

      Steps to Reproduce:

          1. Run the conformance test suite on https://github.com/openshift/release/pull/64789 (this will be come a periodic job)
         

      Actual results:

          The live migration test fails

      Expected results:

          The live migration test passes

      Additional info:

      Calico plans:

      Calico needs to implement the following to support live migration:
      - Detect a new VM pod is the target VM of a live migration.
      - Assign original IP of the source VM to the target VM.
      - Keep advertising original route(/32) to source VM during live migration.
      - Detect migration is complete.
      - Switch over to advertise new route.    

              kmajcher@redhat.com Krzysztof Majcher
              mgencur@redhat.com Martin Gencur
              None
              None
              Yu Li Yu Li
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: