-
Bug
-
Resolution: Not a Bug
-
Undefined
-
None
-
None
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
-
No
Description of problem:
The tcp connection created towards a VM sometimes is broken after live migration when using RHCOS images with ovn-kubernetes and bridge binding. Looking at the endpointslices from the service after live migration we see the following transition old: addresses: - 10.244.1.7 conditions: ready: true serving: true terminating: false nodeName: ovn-worker targetRef: kind: Pod name: virt-launcher-worker1-ndqjc namespace: kv-live-migration-1994 uid: 73606e39-4b86-4af4-a072-84ad308cf490 - addresses: - 10.244.1.7 conditions: ready: false serving: false terminating: false nodeName: ovn-worker2 targetRef: kind: Pod name: virt-launcher-worker1-bpm95 namespace: kv-live-migration-1994 uid: e8e2aaa1-5814-406d-9b48-1327398a4b5c new: addresses: - 10.244.1.7 conditions: ready: false serving: false terminating: false nodeName: ovn-worker targetRef: kind: Pod name: virt-launcher-worker1-ndqjc namespace: kv-live-migration-1994 uid: 73606e39-4b86-4af4-a072-84ad308cf490 - addresses: - 10.244.1.7 conditions: ready: false serving: false terminating: false nodeName: ovn-worker2 targetRef: kind: Pod name: virt-launcher-worker1-bpm95 namespace: kv-live-migration-1994 uid: e8e2aaa1-5814-406d-9b48-1327398a4b5c Since ovn-kubernetes detects that both endpoints are not ready it remove the ovn/ovs network infrastructure and that breaks the connection If we see the virt-launcher target pod status target pod status: phase: Running conditions: - lastProbeTime: null lastTransitionTime: "2024-02-21T09:00:02Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2024-02-21T08:59:40Z" message: corresponding condition of pod readiness gate "kubevirt.io/virtual-machine-unpaused" does not exist. reason: ReadinessGatesNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2024-02-21T09:00:09Z" status: "True" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2024-02-21T08:59:40Z" status: "True" type: PodScheduled - lastProbeTime: "2024-02-21T09:00:29Z" lastTransitionTime: "2024-02-21T09:00:29Z" message: the virtual machine is not paused reason: NotPaused status: "True" type: kubevirt.io/virtual-machine-unpaused We see that readiness gate mechanism need to catch with conditions but the Pod Phase is already Running.
Version-Release number of selected component (if applicable):
How reproducible:
3/12 live migrations
Steps to Reproduce:
1.Create the following VM with an tcp server running on it
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
name: worker2
annotations:
kubevirt.io/allow-pod-bridge-network-live-migration: ""
spec:
nodeSelector:
hypershift: "true"
architecture: amd64
domain:
devices:
disks:
- disk:
bus: virtio
name: containerdisk
- disk:
bus: virtio
name: cloudinitdisk
interfaces:
- bridge: {}
name: pod
rng: {}
machine:
type: q35
resources:
requests:
memory: 512Mi
networks:
- pod: {}
name: pod
nodeSelector:
node-role.kubernetes.io/worker: ""
terminationGracePeriodSeconds: 5
volumes:
- containerDisk:
image: quay.io/fedora/fedora-coreos-kubevirt:stable
name: containerdisk
- cloudInitConfigDrive:
userData: '{"ignition":{"version":"3.3.0"},"passwd":{"users":[{"name":"core","passwordHash":"$y$j9T$b7RFf2LW7MUOiF4RyLHKA0$T.Ap/uzmg8zrTcUNXyXvBvT26UgkC6zZUVg3UKXeEp5"}]},"storage":{"files":[{"path":"/etc/nmstate/001-dual-stack-dhcp.yml","contents":{"compression":"gzip","source":"data:;base64,H4sIAAAAAAAC/4zKQQrCMBCF4f2c4l1AUBAXc5sxfaGBOh2SScHbiy5cd/n9/M2TvVrhULnA7UUFPW7jKkC+48tc2Z0pwEhLKmYI0OK4qwAA3Z4bF0X2yV9Z1hJ/tjgep0bAZu5l96qotg3KJwAA//+PTU/JngAAAA=="}},{"path":"/etc/nmstate/002-dual-sack-ipv6-gw.yml","contents":{"compression":"","source":"data:;base64,cm91dGVzOgogIGNvbmZpZzoKICAtIGRlc3RpbmF0aW9uOiA6Oi8wCiAgICBuZXh0LWhvcC1pbnRlcmZhY2U6IGVucDFzMAogICAgbmV4dC1ob3AtYWRkcmVzczogZDdiOjZiNGQ6N2IyNTpkMjJmOjoxCg=="}}]}}'
name: cloudinitdisk
2. Create service to access VMs tcp server
3. Create a tcp connection to the tcp server
4. Do live migration
5. Send traffic over the openned tcp connection
6. goto 4
Actual results:
After some iterations the tcp connection is broken
Expected results:
tcp connection should not be broken
Additional info:
Checking at kubevirt live migration code pod.Conditions.Ready is not check so it continue with migration and source pod is completed. https://github.com/kubevirt/kubevirt/blob/657665ce8a0175622326b0aa50fb4635bb8b637c/pkg/virt-controller/watch/vmi.go#L1101 Kubevirt should check also the pod Ready condition.