Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-38604

kubevirt tcp connection broken after successful live migration

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Undefined Undefined
    • None
    • None
    • CNV Network
    • Quality / Stability / Reliability
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • No

      Description of problem:

       

      The tcp connection created towards a VM sometimes is broken after live migration when using RHCOS images with ovn-kubernetes and bridge binding.
      
      Looking at the endpointslices from the service after live migration we see the following transition
      
      old:
      addresses:
        - 10.244.1.7                                                              
        conditions:                                                             
          ready: true                                                           
          serving: true                                                         
          terminating: false                                                    
        nodeName: ovn-worker                                                   
        targetRef:                                                              
          kind: Pod                                                             
          name: virt-launcher-worker1-ndqjc                                     
          namespace: kv-live-migration-1994                                     
          uid: 73606e39-4b86-4af4-a072-84ad308cf490                            
      - addresses:                                                              
        - 10.244.1.7                                                            
        conditions:                                                             
          ready: false                                                          
          serving: false                                                        
          terminating: false                                                    
        nodeName: ovn-worker2                                                   
        targetRef:                                                              
          kind: Pod                                                            
          name: virt-launcher-worker1-bpm95                                     
          namespace: kv-live-migration-1994                                     
          uid: e8e2aaa1-5814-406d-9b48-1327398a4b5c
      
      new:
      addresses:
        - 10.244.1.7
        conditions:
          ready: false
          serving: false
          terminating: false
        nodeName: ovn-worker
        targetRef:
          kind: Pod
          name: virt-launcher-worker1-ndqjc
          namespace: kv-live-migration-1994
          uid: 73606e39-4b86-4af4-a072-84ad308cf490
      - addresses:
        - 10.244.1.7
        conditions:
          ready: false
          serving: false
          terminating: false
        nodeName: ovn-worker2
        targetRef:
          kind: Pod
          name: virt-launcher-worker1-bpm95
          namespace: kv-live-migration-1994
          uid: e8e2aaa1-5814-406d-9b48-1327398a4b5c
                                   
      Since ovn-kubernetes detects that both endpoints are not ready it remove the ovn/ovs network infrastructure and that breaks the connection
      
      If we see the virt-launcher target pod status
      
      target pod
        status:
          phase: Running                                                                conditions:
          - lastProbeTime: null 
            lastTransitionTime: "2024-02-21T09:00:02Z"
            status: "True"
            type: Initialized
          - lastProbeTime: null
            lastTransitionTime: "2024-02-21T08:59:40Z"
            message: corresponding condition of pod readiness gate "kubevirt.io/virtual-machine-unpaused"
              does not exist.
            reason: ReadinessGatesNotReady
            status: "False"
            type: Ready
          - lastProbeTime: null
            lastTransitionTime: "2024-02-21T09:00:09Z"
            status: "True"
            type: ContainersReady
          - lastProbeTime: null
            lastTransitionTime: "2024-02-21T08:59:40Z"
            status: "True"
            type: PodScheduled
          - lastProbeTime: "2024-02-21T09:00:29Z"
            lastTransitionTime: "2024-02-21T09:00:29Z"
            message: the virtual machine is not paused
            reason: NotPaused
            status: "True"
            type: kubevirt.io/virtual-machine-unpaused                                
      
      We see that readiness gate mechanism need to catch with conditions but the Pod Phase is already Running.
      
      
      
      

      Version-Release number of selected component (if applicable):

      How reproducible:

      3/12 live migrations

      Steps to Reproduce:

      1.Create the following VM with an tcp server running on it
      
      apiVersion: kubevirt.io/v1
      kind: VirtualMachineInstance
      metadata:
        name: worker2
        annotations:
          kubevirt.io/allow-pod-bridge-network-live-migration: ""
      spec:
        nodeSelector:
          hypershift: "true"
        architecture: amd64
        domain:
          devices:
            disks:
            - disk:
                bus: virtio
              name: containerdisk
            - disk:
                bus: virtio
              name: cloudinitdisk
            interfaces:
            - bridge: {}
              name: pod
            rng: {}
          machine:
            type: q35
          resources:
            requests:
              memory: 512Mi
        networks:
        - pod:  {}
          name: pod
        nodeSelector:
          node-role.kubernetes.io/worker: ""
        terminationGracePeriodSeconds: 5
        volumes:
        - containerDisk:
            image: quay.io/fedora/fedora-coreos-kubevirt:stable
          name: containerdisk
        - cloudInitConfigDrive:
            userData: '{"ignition":{"version":"3.3.0"},"passwd":{"users":[{"name":"core","passwordHash":"$y$j9T$b7RFf2LW7MUOiF4RyLHKA0$T.Ap/uzmg8zrTcUNXyXvBvT26UgkC6zZUVg3UKXeEp5"}]},"storage":{"files":[{"path":"/etc/nmstate/001-dual-stack-dhcp.yml","contents":{"compression":"gzip","source":"data:;base64,H4sIAAAAAAAC/4zKQQrCMBCF4f2c4l1AUBAXc5sxfaGBOh2SScHbiy5cd/n9/M2TvVrhULnA7UUFPW7jKkC+48tc2Z0pwEhLKmYI0OK4qwAA3Z4bF0X2yV9Z1hJ/tjgep0bAZu5l96qotg3KJwAA//+PTU/JngAAAA=="}},{"path":"/etc/nmstate/002-dual-sack-ipv6-gw.yml","contents":{"compression":"","source":"data:;base64,cm91dGVzOgogIGNvbmZpZzoKICAtIGRlc3RpbmF0aW9uOiA6Oi8wCiAgICBuZXh0LWhvcC1pbnRlcmZhY2U6IGVucDFzMAogICAgbmV4dC1ob3AtYWRkcmVzczogZDdiOjZiNGQ6N2IyNTpkMjJmOjoxCg=="}}]}}'
          name: cloudinitdisk
      
      2. Create service to access VMs tcp server
      3. Create a tcp connection to the tcp server
      4. Do live migration
      5. Send traffic over the openned tcp connection
      6. goto 4

      Actual results:

      After some iterations the tcp connection is broken 

      Expected results:

      tcp connection should not be broken

      Additional info:

      Checking at kubevirt live migration code pod.Conditions.Ready is not check so it continue with migration and source pod is completed.
      
      https://github.com/kubevirt/kubevirt/blob/657665ce8a0175622326b0aa50fb4635bb8b637c/pkg/virt-controller/watch/vmi.go#L1101
      
      Kubevirt should check also the pod Ready condition.

       

              ellorent Felix Enrique Llorente Pastora
              ellorent Felix Enrique Llorente Pastora
              Nir Rozen Nir Rozen (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: