-
Bug
-
Resolution: Not a Bug
-
Undefined
-
None
-
None
-
None
After live migrating one VM to a different node, traffic to/from it becomes permanently broken.
Reproduce the issue with the following manifests:
--- apiVersion: k8s.ovn.org/v1 kind: VTEP metadata: name: evpn-vtep spec: mode: Unmanaged cidrs: - 192.168.122.0/24 # must adjust this to match the node subnet --- apiVersion: v1 kind: Namespace metadata: name: evpn-demo labels: network: evpn-demo k8s.ovn.org/primary-user-defined-network: evpn-l2 --- apiVersion: k8s.ovn.org/v1 kind: ClusterUserDefinedNetwork metadata: name: evpn-l2 labels: evpn: "true" spec: namespaceSelector: matchLabels: network: evpn-demo network: topology: Layer2 transport: EVPN layer2: role: Primary subnets: - 10.200.0.0/16 ipam: lifecycle: Persistent # VMs need this evpn: vtep: evpn-vtep macVRF: vni: 20100 routeTarget: "65000:20100" ipVRF: vni: 20101 routeTarget: "65000:20101" --- apiVersion: k8s.ovn.org/v1 kind: RouteAdvertisements metadata: name: evpn-routes spec: nodeSelector: {} frrConfigurationSelector: {} networkSelectors: - networkSelectionType: ClusterUserDefinedNetworks clusterUserDefinedNetworkSelector: networkSelector: matchLabels: evpn: "true" targetVRF: auto advertisements: - PodNetwork
Now provision the workloads; pods:
apiVersion: apps/v1 kind: Deployment metadata: name: nettools namespace: evpn-demo spec: replicas: 2 selector: matchLabels: app: nettools template: metadata: labels: app: nettools spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: nettools topologyKey: kubernetes.io/hostname containers: - name: nettools image: docker.io/nicolaka/netshoot:v0.13 command: ["sleep", "infinity"]
And VM:
---
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
labels:
kubevirt.io/vm: vm1
name: vm1
namespace: evpn-demo
spec:
runStrategy: Always
template:
metadata:
name: vm1
namespace: evpn-demo
labels:
app.kubernetes.io/name: evpn-demo
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: containerdisk
- disk:
bus: virtio
name: cloudinitdisk
interfaces:
- name: evpn
binding:
name: l2bridge
rng: {}
resources:
requests:
memory: 2048M
networks:
- pod: {}
name: evpn
terminationGracePeriodSeconds: 0
volumes:
- containerDisk:
image: quay.io/kubevirt/fedora-with-test-tooling-container-disk:v1.7.0
name: containerdisk
- cloudInitNoCloud:
userData: |-
#cloud-config
password: fedora
chpasswd: { expire: False }
name: cloudinitdisk
Check the IP address of the vm, start a ping from one of the pods to the VM.
Then issue the migrate command:
virtctl migrate -nevpn-demo vm1
Traffic will stop, and the following errors will be seen in the pod:
From 10.200.0.6 icmp_seq=988 Destination Host Unreachable From 10.200.0.6 icmp_seq=989 Destination Host Unreachable From 10.200.0.6 icmp_seq=990 Destination Host Unreachable From 10.200.0.6 icmp_seq=991 Destination Host Unreachable From 10.200.0.6 icmp_seq=992 Destination Host Unreachable
The FDB tables on the worker nodes where the VM originated from, or moved to are:
Before migration:
# SRC node (where the VM is running) sh-5.1# bridge fdb | grep 0a:58:0a:c8:00:0d 0a:58:0a:c8:00:0d dev evpn-evpn-l2 vlan 4 master evbr-evpn-vtep # DST node sh-5.1# bridge fdb | grep 0a:58:0a:c8:00:0d 0a:58:0a:c8:00:0d dev evx4-evpn-vtep vlan 4 extern_learn master evbr-evpn-vtep 0a:58:0a:c8:00:0d dev evx4-evpn-vtep dst 192.168.122.37 src_vni 20100 self extern_learn
After migration:
# SRC node sh-5.1# bridge fdb | grep 0a:58:0a:c8:00:0d 0a:58:0a:c8:00:0d dev evpn-evpn-l2 vlan 4 master evbr-evpn-vtep # DST node (where the VM is running) sh-5.1# bridge fdb | grep 0a:58:0a:c8:00:0d 0a:58:0a:c8:00:0d dev evx4-evpn-vtep vlan 4 extern_learn master evbr-evpn-vtep 0a:58:0a:c8:00:0d dev evx4-evpn-vtep dst 192.168.122.37 src_vni 20100 self extern_learn
After a while, the FDB entries time out, and are garbage collected.
- relates to
-
CORENET-6853 kubevirt traffic disrupted during live migration with EVPN at target LSP
-
- Code Review
-