Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-29499

[2213262] Lost connectivity after live migration of a VM with a hot-plugged disk

XMLWordPrintable

    • 4
    • CNV-net-QE-253
    • High

      Version-Release number of selected component (if applicable):
      4.13.0

      How reproducible:
      Every time

      Expected results:
      Able to Live Migrate to any OCP Worker node

      Additional Info:
      OCP 4.13.0 BareMetal
      Worker/Infra nodes have 4 links in a bond configuration
      OCP Virt 4.13.0
      ODF 4.12
      nncp & nad: cnv-bridge

      The issue is as follows:

      • Created new VM (rhel8)with dual network stack (default net + bridge net)
      • Single rootdisk as part of the initial creation (backed by ODF ceph rbd)
      • Start VM
      • Start pinging guest VM from an utility host.
      • Live migrate the guest vm to any of 3 workers. Have done this several times to make sure I hit every worker node in the cluster.
      • Pinging works (it takes maybe 1-2 hits while the guest vm moves around worker node but it resumes normally)
      • Network inside the guest vm works. Can ping other hosts on the network, as well as gateway).

      Here is where I run into issues

      • Hot add a disk to the guest VM (blank pvc coming from ODF ceph rbd storageclass)
      • Verify the disk is added via console
      • (pinging still working)
      • Initiate a Live Migrate
      • Wait for guest VM to finish migrating
      • (pinging stops responding)
      • Log into guest console and check a few things (like ip route, ip neigh, etc)
      • Issue a systemctl restart NetworkManager on guest vm and although this succeeds, I can't ping anything like other hosts in the same br network or even the gateway.

      In order for pinging (or the vm guest network for that matter) to resume, I do another Live Migrate and hope the guest vm lands on the original worker node where I added the host disk. <- this part is interesting. Why does network resumes only if the guest vm lands back in the worker node where it was when I added the hot disk?

      I verified this with other test VMs where I wrote down the worker node, tested the network, then added 1 additional disk and then Live Migrated.... Network breaks until the guest vm is back on said worker node where I added the disk.

      NNCP Config:
      apiVersion: nmstate.io/v1
      kind: NodeNetworkConfigurationPolicy
      metadata:
      name: br1-bond0-policy
      spec:
      desiredState:
      interfaces:

      • name: br1
        description: Bond0 br1
        type: linux-bridge
        state: up
        ipv4:
        enabled: false
        bridge:
        options:
        stp:
        enabled: false
        port:
      • name: bond0

      NAD Config:
      apiVersion: k8s.cni.cncf.io/v1
      kind: NetworkAttachmentDefinition
      metadata:
      annotations:
      description: Hypervisor
      k8s.v1.cni.cncf.io/resourceName: bridge.network.kubevirt.io/br1
      generation: 2
      name: br1-vlan192
      namespace: test-vmis
      spec:
      config: >-
      {"name":"br1-vlan192","type":"cnv-bridge","cniVersion":"0.3.1","bridge":"br1","vlan":192,"macspoofchk":true,"ipam":{},"preserveDefaultVlan": false}

      NetworkManager from Worker Nodes (this one from wrk1)
      [connection]
      id=bond0.104
      uuid=912add91-19a5-4ac1-9f6c-1f137453dddd
      type=vlan
      interface-name=bond0.104
      autoconnect=true
      autoconnect-priority=1

      [ethernet]

      [vlan]
      flags=1
      id=104
      parent=208a8ef4-8a95-4425-b4ad-58c7431614b9

      [ipv4]
      address1=10.176.104.170/22
      dhcp-client-id=mac
      dns=209.196.203.128;
      dns-priority=40
      dns-search=corp.CLIENTNAME.com;
      method=manual
      route1=0.0.0.0/0,10.176.107.254
      route1_options=table=254

      [ipv6]
      addr-gen-mode=eui64
      dhcp-duid=ll
      dhcp-iaid=mac
      method=disabled

      [proxy]

      ----------------------------------------------------
      [connection]
      id=bond0
      uuid=208a8ef4-8a95-4425-b4ad-58c7431614b9
      type=bond
      autoconnect-priority=1
      autoconnect-slaves=1
      interface-name=bond0
      master=eda12b69-4e74-47b2-b7bf-6497855f226e
      slave-type=bridge
      timestamp=1685497889

      [ethernet]
      cloned-mac-address=3C:EC:EF:74:4D:80

      [bond]
      miimon=100
      mode=802.3ad

      [bridge-port]
      vlans=2-4094

      ----------------------------------------------------
      [connection]
      id=br1
      uuid=eda12b69-4e74-47b2-b7bf-6497855f226e
      type=bridge
      autoconnect-slaves=1
      interface-name=br1
      timestamp=1685726538

      [ethernet]

      [bridge]
      stp=false
      vlan-filtering=true

      [ipv4]
      method=disabled

      [ipv6]
      addr-gen-mode=default
      method=disabled

      [proxy]

      [user]
      nmstate.interface.description=Bond0 br1

      ----------------------------------------------------
      [connection]
      id=eno1np0
      uuid=e469b9bd-c767-4819-80b2-5363f17ba870
      type=ethernet
      interface-name=eno1np0
      master=208a8ef4-8a95-4425-b4ad-58c7431614b9
      slave-type=bond
      autoconnect=true
      autoconnect-priority=1

      ----------------------------------------------------
      [connection]
      id=eno2np1
      uuid=9d5a4724-54d7-4851-bd45-5262a7990908
      type=ethernet
      interface-name=eno2np1
      master=208a8ef4-8a95-4425-b4ad-58c7431614b9
      slave-type=bond
      autoconnect=true
      autoconnect-priority=1

      ----------------------------------------------------
      [connection]
      id=enp1s0f0
      uuid=333f032e-42a8-41b3-94aa-5872ddb647e4
      type=ethernet
      interface-name=enp1s0f0
      master=208a8ef4-8a95-4425-b4ad-58c7431614b9
      slave-type=bond
      autoconnect=true
      autoconnect-priority=1

      ----------------------------------------------------
      [connection]
      id=enp1s0f1
      uuid=3542e9b7-0bad-4e36-b734-0eff79071cac
      type=ethernet
      interface-name=enp1s0f1
      master=208a8ef4-8a95-4425-b4ad-58c7431614b9
      slave-type=bond
      autoconnect=true
      autoconnect-priority=1

      VMI:
      apiVersion: kubevirt.io/v1
      kind: VirtualMachineInstance
      metadata:
      annotations:
      kubevirt.io/latest-observed-api-version: v1
      kubevirt.io/storage-observed-api-version: v1alpha3
      vm.kubevirt.io/flavor: small
      vm.kubevirt.io/os: rhel8
      vm.kubevirt.io/workload: server
      creationTimestamp: "2023-06-06T15:42:49Z"
      finalizers:

      • kubevirt.io/virtualMachineControllerFinalize
      • foregroundDeleteVirtualMachine
        generation: 15
        labels:
        kubevirt.io/domain: dvuulvocpvmi02
        kubevirt.io/nodeName: dvuuopwkr03
        kubevirt.io/size: small
        name: dvuulvocpvmi02
        namespace: test-vmis
        ownerReferences:
      • apiVersion: kubevirt.io/v1
        blockOwnerDeletion: true
        controller: true
        kind: VirtualMachine
        name: dvuulvocpvmi02
        uid: 4c3b425a-4e4c-4533-9184-f0680cbf185d
        resourceVersion: "12322288"
        uid: 1811fa44-e509-431c-a937-3ad32e8d127f
        spec:
        domain:
        cpu:
        cores: 2
        model: host-model
        sockets: 1
        threads: 1
        devices:
        disks:
      • bootOrder: 2
        disk:
        bus: virtio
        name: rootdisk
      • disk:
        bus: scsi
        name: disk-little-walrus
        interfaces:
      • macAddress: 02:bb:06:00:00:06
        masquerade: {}
        model: virtio
        name: default
      • bridge: {}
        macAddress: 02:bb:06:00:00:07
        model: virtio
        name: nic-liable-mollusk
        networkInterfaceMultiqueue: true
        rng: {}
        features:
        acpi:
        enabled: true
        firmware:
        uuid: 5a07e466-7638-51a5-9fdd-8ab5e24aebe4
        machine:
        type: pc-q35-rhel9.2.0
        resources:
        requests:
        memory: 4Gi
        evictionStrategy: LiveMigrate
        networks:
      • name: default
        pod: {}
      • multus:
        networkName: test-vmis/br1-vlan192
        name: nic-liable-mollusk
        terminationGracePeriodSeconds: 180
        volumes:
      • dataVolume:
        name: dvuulvocpvmi02
        name: rootdisk
      • dataVolume:
        hotpluggable: true
        name: dvuulvocpvmi02-disk-little-walrus
        name: disk-little-walrus
        status:
        activePods:
        abd603fb-a0ff-4bd0-bf38-ac515cf49c83: dvuuopwkr03
        conditions:
      • lastProbeTime: null
        lastTransitionTime: "2023-06-06T15:43:10Z"
        status: "True"
        type: Ready
      • lastProbeTime: null
        lastTransitionTime: null
        status: "True"
        type: LiveMigratable
      • lastProbeTime: "2023-06-06T15:43:31Z"
        lastTransitionTime: null
        status: "True"
        type: AgentConnected
        guestOSInfo:
        id: rhel
        kernelRelease: 4.18.0-425.19.2.el8_7.x86_64
        kernelVersion: '#1 SMP Fri Mar 17 01:52:38 EDT 2023'
        name: Red Hat Enterprise Linux
        prettyName: Red Hat Enterprise Linux 8.7 (Ootpa)
        version: "8.7"
        versionId: "8.7"
        interfaces:
      • infoSource: domain, guest-agent
        interfaceName: enp1s0
        ipAddress: 192.168.6.231
        ipAddresses:
      • 192.168.6.231
        mac: 02:bb:06:00:00:06
        name: default
        queueCount: 2
      • infoSource: domain, guest-agent
        interfaceName: enp2s0
        ipAddress: 10.176.192.151
        ipAddresses:
      • 10.176.192.151
      • fe80::bb:6ff:fe00:7
        mac: 02:bb:06:00:00:07
        name: nic-liable-mollusk
        queueCount: 2
        launcherContainerImageVersion: registry.redhat.io/container-native-virtualization/virt-launcher-rhel9@sha256:8d493a50ff05c3b9f30d3ccdd93acec3b1d7fdc07324ce4b92521c6b084496b3
        migrationMethod: LiveMigration
        migrationTransport: Unix
        nodeName: dvuuopwkr03
        phase: Running
        phaseTransitionTimestamps:
      • phase: Pending
        phaseTransitionTimestamp: "2023-06-06T15:42:49Z"
      • phase: Scheduling
        phaseTransitionTimestamp: "2023-06-06T15:42:49Z"
      • phase: Scheduled
        phaseTransitionTimestamp: "2023-06-06T15:43:10Z"
      • phase: Running
        phaseTransitionTimestamp: "2023-06-06T15:43:13Z"
        qosClass: Burstable
        runtimeUser: 107
        selinuxContext: system_u:object_r:container_file_t:s0:c714,c978
        virtualMachineRevisionName: revision-start-vm-4c3b425a-4e4c-4533-9184-f0680cbf185d-17
        volumeStatus:
      • hotplugVolume:
        attachPodName: hp-volume-7vnjd
        attachPodUID: 843df1d6-191f-4581-a1e1-4ba9daa15a49
        message: Successfully attach hotplugged volume disk-little-walrus to VM
        name: disk-little-walrus
        persistentVolumeClaimInfo:
        accessModes:
      • ReadWriteMany
        capacity:
        storage: 5Gi
        filesystemOverhead: "0"
        requests:
        storage: "5368709120"
        volumeMode: Block
        phase: Ready
        reason: VolumeReady
        target: sda
      • name: rootdisk
        persistentVolumeClaimInfo:
        accessModes:
      • ReadWriteMany
        capacity:
        storage: 100Gi
        filesystemOverhead: "0"
        requests:
        storage: "107374182400"
        volumeMode: Block
        target: vda

            sbahar Shahaf Bahar
            fmontero@redhat.com Freddy Montero
            Nir Rozen Nir Rozen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: