Uploaded image for project: 'OpenShift Core Networking'
  1. OpenShift Core Networking
  2. CORENET-6849

Traffic is broken after live migrating a VM

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • 1
    • 1
    • Critical
    • None
    • None
    • CORENET Sprint 284
    • None
    • None
    • None

      After live migrating one VM to a different node, traffic to/from it becomes permanently broken.

      Reproduce the issue with the following manifests:

      ---
      apiVersion: k8s.ovn.org/v1
      kind: VTEP
      metadata:
        name: evpn-vtep
      spec:
        mode: Unmanaged
        cidrs:
          - 192.168.122.0/24 # must adjust this to match the node subnet
      ---
      apiVersion: v1
      kind: Namespace
      metadata:
        name: evpn-demo
        labels:
          network: evpn-demo
          k8s.ovn.org/primary-user-defined-network: evpn-l2
      ---
      apiVersion: k8s.ovn.org/v1
      kind: ClusterUserDefinedNetwork
      metadata:
        name: evpn-l2
        labels:
          evpn: "true"
      spec:
        namespaceSelector:
          matchLabels:
            network: evpn-demo
        network:
          topology: Layer2
          transport: EVPN
          layer2:
            role: Primary
            subnets:
              - 10.200.0.0/16
            ipam:
              lifecycle: Persistent # VMs need this
          evpn:
            vtep: evpn-vtep
            macVRF:
              vni: 20100
              routeTarget: "65000:20100"
            ipVRF:
              vni: 20101
              routeTarget: "65000:20101"
      ---
      apiVersion: k8s.ovn.org/v1
      kind: RouteAdvertisements
      metadata:
        name: evpn-routes
      spec:
        nodeSelector: {}
        frrConfigurationSelector: {}
        networkSelectors:
          - networkSelectionType: ClusterUserDefinedNetworks
            clusterUserDefinedNetworkSelector:
              networkSelector:
                matchLabels:
                  evpn: "true"
        targetVRF: auto
        advertisements:
          - PodNetwork
      

      Now provision the workloads; pods:

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: nettools
        namespace: evpn-demo
      spec:
        replicas: 2
        selector:
          matchLabels:
            app: nettools
        template:
          metadata:
            labels:
              app: nettools
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchLabels:
                        app: nettools
                    topologyKey: kubernetes.io/hostname
            containers:
              - name: nettools
                image: docker.io/nicolaka/netshoot:v0.13
                command: ["sleep", "infinity"]
      

      And VM:

      ---
      apiVersion: kubevirt.io/v1
      kind: VirtualMachine
      metadata:
        labels:
          kubevirt.io/vm: vm1
        name: vm1
        namespace: evpn-demo 
      spec:
        runStrategy: Always
        template:
          metadata:
            name: vm1
            namespace: evpn-demo
            labels:
              app.kubernetes.io/name: evpn-demo
          spec:
            domain:
              devices:
                disks:
                - disk:
                    bus: virtio
                  name: containerdisk
                - disk:
                    bus: virtio
                  name: cloudinitdisk
                interfaces:
                - name: evpn
                  binding:
                    name: l2bridge
                rng: {}
              resources:
                requests:
                  memory: 2048M
            networks:
            - pod: {}
              name: evpn
            terminationGracePeriodSeconds: 0
            volumes:
            - containerDisk:
                image: quay.io/kubevirt/fedora-with-test-tooling-container-disk:v1.7.0
              name: containerdisk
            - cloudInitNoCloud:
                userData: |-
                  #cloud-config
                  password: fedora
                  chpasswd: { expire: False }
              name: cloudinitdisk 

      Check the IP address of the vm, start a ping from one of the pods to the VM.

       

      Then issue the migrate command:

      virtctl migrate -nevpn-demo vm1

      Traffic will stop, and the following errors will be seen in the pod:

      From 10.200.0.6 icmp_seq=988 Destination Host Unreachable
      From 10.200.0.6 icmp_seq=989 Destination Host Unreachable
      From 10.200.0.6 icmp_seq=990 Destination Host Unreachable
      From 10.200.0.6 icmp_seq=991 Destination Host Unreachable
      From 10.200.0.6 icmp_seq=992 Destination Host Unreachable
      

      The FDB tables on the worker nodes where the VM originated from, or moved to are:

      Before migration:

      # SRC node (where the VM is running)
      sh-5.1# bridge fdb | grep 0a:58:0a:c8:00:0d
      0a:58:0a:c8:00:0d dev evpn-evpn-l2 vlan 4 master evbr-evpn-vtep 
      
      # DST node
      sh-5.1# bridge fdb | grep 0a:58:0a:c8:00:0d
      0a:58:0a:c8:00:0d dev evx4-evpn-vtep vlan 4 extern_learn master evbr-evpn-vtep 
      0a:58:0a:c8:00:0d dev evx4-evpn-vtep dst 192.168.122.37 src_vni 20100 self extern_learn 

      After migration:

      # SRC node
      sh-5.1# bridge fdb | grep 0a:58:0a:c8:00:0d 0a:58:0a:c8:00:0d dev evpn-evpn-l2 vlan 4 master evbr-evpn-vtep
        
      # DST node (where the VM is running)
      sh-5.1# bridge fdb | grep 0a:58:0a:c8:00:0d 0a:58:0a:c8:00:0d dev evx4-evpn-vtep vlan 4 extern_learn master evbr-evpn-vtep  0a:58:0a:c8:00:0d dev evx4-evpn-vtep dst 192.168.122.37 src_vni 20100 self extern_learn 

      After a while, the FDB entries time out, and are garbage collected.

       

              Unassigned Unassigned
              mduarted@redhat.com Miguel Duarte de Mora Barroso
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: