Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-2962

QE verification: Intermittent connectivity issues with multiple virtual machines across various nodes in their OpenShift Container Virtualization (OCV) environment.

    • Icon: Task Task
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • ovs-dpdk
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      ( ) The bug has been reproduced and verified by QE members
      ( ) Test coverage has been added to downstream CI
      ( ) For new feature, failed test plans have bugs added as children to the epic
      ( ) The bug is cloned to any relevant release that we support and/or is needed

      Show
      ( ) The bug has been reproduced and verified by QE members ( ) Test coverage has been added to downstream CI ( ) For new feature, failed test plans have bugs added as children to the epic ( ) The bug is cloned to any relevant release that we support and/or is needed
    • rhel-9
    • None

      This ticket is tracking the QE verification effort for the solution to the problem described below.
      Description of problem: 

      The customer is experiencing intermittent connectivity issues with multiple virtual machines, one of the windows VM shows the behavior consistently.  
      The cluster was upgraded from 4.17 to 4.18.18 since the issue is observed.

      The issue was characterized by frequent disconnections, which sometimes resolve after migrating the VM to another node or restarting the ovnkube-node pod of the node or on its own. 

      The issue can be seen on a windows VM currently, in past the behavior was observed on the linux VM's as well spread across different nodes of the cluster.

       

      How the VMS are connection to the Physical Interface: **

      bond1 --> enbd-ex(OVS Bridge)-Localnet NAD 

       

      From the recent troubleshooting we found the problem between two nodes within one VLAN in two different nodes; **

      Source:

      Node:  lben203vpm017u
      VM : lvenaacaac601u(10.119.134.31)
      VLAN: 2434 

       

      # ovn-nbctl show nad.2434_ovn_localnet_switch
      switch ea5a5b0f-4697-4450-997d-a6d197e3a291 (nad.2434_ovn_localnet_switch)
          port aac.574.nad.2434_aac-574_virt-launcher-lvenaacaac601u-ct74v
              addresses: ["00:50:56:97:79:64"]
          port cxb.9.nad.2434_cxb-9_virt-launcher-lvencxbapp601u-jpgp2
              addresses: ["02:8e:62:00:00:9f"]
          port nad.2434_ovn_localnet_port
              type: localnet
              tag: 2434
              addresses: ["unknown"]
          port adc.31.nad.2434_adc-31_virt-launcher-wvenadcadc403u-hv69w
              addresses: ["00:50:56:97:67:b2"]
      

       

      Packet Capture from the source system(lben203vpm017u) confirms that ARP packet from the source VM 10.119.134.31 never exits through physical interface(bond1|enbd-ex), as a result the destination node lben213vpm007u never receives this packet, expectedly VM can't receive it. 

       

      # tcpdump -i any host 10.119.134.58 
      tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
      listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
      09:35:14.751059 0c25ea895989a_3 B   ARP, Request who-has 10.119.134.58 tell 10.119.134.31, length 28
      09:35:14.751076 1c9e4579fcddb_3 Out ARP, Request who-has 10.119.134.58 tell 10.119.134.31, length 28
      09:35:14.751080 6c6fd1c37c900_3 Out ARP, Request who-has 10.119.134.58 tell 10.119.134.31, length 28
      09:35:15.775934 0c25ea895989a_3 B   ARP, Request who-has 10.119.134.58 tell 10.119.134.31, length 28
      09:35:15.775951 1c9e4579fcddb_3 Out ARP, Request who-has 10.119.134.58 tell 10.119.134.31, length 28
      09:35:15.775954 6c6fd1c37c900_3 Out ARP, Request who-has 10.119.134.58 tell 10.119.134.31, length 28
      09:35:16.798941 0c25ea895989a_3 B   ARP, Request who-has 10.119.134.58 tell 10.119.134.31, length 28
      09:35:16.798945 1c9e4579fcddb_3 Out ARP, Request who-has 10.119.134.58 tell 10.119.134.31, length 28
      09:35:16.798947 6c6fd1c37c900_3 Out ARP, Request who-has 10.119.134.58 tell 10.119.134.31, length 28
      09:35:17.823152 0c25ea895989a_3 B   ARP, Request who-has 10.119.134.58 tell 10.119.134.31, length 28
      09:35:17.823156 1c9e4579fcddb_3 Out ARP, Request who-has 10.119.134.58 tell 10.119.134.31, length 2

       

       

      Destination:

      Node:  lben213vpm007u
      VM : wvenaacaac304u(10.119.134.58)
      VLAN: 2434 

       

      # ovn-nbctl show nad.2434_ovn_localnet_switch
      switch b400ae4b-1810-4ea2-aff0-872cb5b5d164 (nad.2434_ovn_localnet_switch)
          port nad.2434_ovn_localnet_port
              type: localnet
              tag: 2434
              addresses: ["unknown"]
          port aac.574.nad.2434_aac-574_virt-launcher-wvenaacaac304u-mcwsz
              addresses: ["00:50:56:97:06:40"]
      

       

      When we did reverse ping from 10.119.134.58(dst) to 10.119.134.31(src), the behavior remained the same that the ARP packet never arrived at physical interface bond1/enbd-ex. 

      [root@lben213vpm007u /]# tcpdump -i any host 10.119.134.31 -nnn
      tcpdump: data link type LINUX_SLL2
      dropped privs to tcpdump
      tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
      listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
      11:24:17.023481 a826ad88d47d6_3 B   ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:17.025672 12b1dfc23c7ec_3 Out ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:18.012001 a826ad88d47d6_3 B   ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:18.012009 12b1dfc23c7ec_3 Out ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:19.012978 a826ad88d47d6_3 B   ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:19.012987 12b1dfc23c7ec_3 Out ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:20.014509 a826ad88d47d6_3 B   ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:20.014517 12b1dfc23c7ec_3 Out ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:21.020162 a826ad88d47d6_3 B   ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:21.020181 12b1dfc23c7ec_3 Out ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:22.020808 a826ad88d47d6_3 B   ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:22.020816 12b1dfc23c7ec_3 Out ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:23.022912 a826ad88d47d6_3 B   ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:23.022932 12b1dfc23c7ec_3 Out ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:24.010992 a826ad88d47d6_3 B   ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:24.010999 12b1dfc23c7ec_3 Out ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:25.011607 a826ad88d47d6_3 B   ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28
      11:24:25.011615 12b1dfc23c7ec_3 Out ARP, Request who-has 10.119.134.31 tell 10.119.134.58, length 28

      When we migrated another VM in 2434 to the lben213vpm007u node, PING worked from that VM to 10.119.134.58. 

       

      ------------------ same node 58 to 98 --------------
      [root@lben213vpm007u /]# tcpdump -i any host 10.119.134.98 and host 10.119.134.58 -nnn
      tcpdump: data link type LINUX_SLL2
      dropped privs to tcpdump
      tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
      listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
      11:27:16.710204 a826ad88d47d6_3 P   IP 10.119.134.58 > 10.119.134.98: ICMP echo request, id 1, seq 57, length 40
      11:27:16.710559 12b1dfc23c7ec_3 Out IP 10.119.134.58 > 10.119.134.98: ICMP echo request, id 1, seq 57, length 40
      11:27:16.710711 12b1dfc23c7ec_3 P   IP 10.119.134.98 > 10.119.134.58: ICMP echo reply, id 1, seq 57, length 40
      11:27:16.711012 a826ad88d47d6_3 Out IP 10.119.134.98 > 10.119.134.58: ICMP echo reply, id 1, seq 57, length 40
      11:27:17.714190 a826ad88d47d6_3 P   IP 10.119.134.58 > 10.119.134.98: ICMP echo request, id 1, seq 58, length 40
      11:27:17.714202 12b1dfc23c7ec_3 Out IP 10.119.134.58 > 10.119.134.98: ICMP echo request, id 1, seq 58, length 40
      11:27:17.714335 12b1dfc23c7ec_3 P   IP 10.119.134.98 > 10.119.134.58: ICMP echo reply, id 1, seq 58, length 40
      11:27:17.714338 a826ad88d47d6_3 Out IP 10.119.134.98 > 10.119.134.58: ICMP echo reply, id 1, seq 58, length 40
      11:27:18.732977 a826ad88d47d6_3 P   IP 10.119.134.58 > 10.119.134.98: ICMP echo request, id 1, seq 59, length 40
      11:27:18.732985 12b1dfc23c7ec_3 Out IP 10.119.134.58 > 10.119.134.98: ICMP echo request, id 1, seq 59, length 40
      11:27:18.733194 12b1dfc23c7ec_3 P   IP 10.119.134.98 > 10.119.134.58: ICMP echo reply, id 1, seq 59, length 40
      11:27:18.733198 a826ad88d47d6_3 Out IP 10.119.134.98 > 10.119.134.58: ICMP echo reply, id 1, seq 59, length 40
      11:27:19.746997 a826ad88d47d6_3 P   IP 10.119.134.58 > 10.119.134.98: ICMP echo request, id 1, seq 60, length 40
      11:27:19.747006 12b1dfc23c7ec_3 Out IP 10.119.134.58 > 10.119.134.98: ICMP echo request, id 1, seq 60, length 40
      11:27:19.747186 12b1dfc23c7ec_3 P   IP 10.119.134.98 > 10.119.134.58: ICMP echo reply, id 1, seq 60, length 40
      11:27:19.747192 a826ad88d47d6_3 Out IP 10.119.134.98 > 10.119.134.58: ICMP echo reply, id 1, seq 60, length 40
      11:27:21.516034 a826ad88d47d6_3 P   ARP, Request who-has 10.119.134.98 (02:8e:62:00:01:42) tell 10.119.134.58, length 28
      11:27:21.518149 12b1dfc23c7ec_3 Out ARP, Request who-has 10.119.134.98 (02:8e:62:00:01:42) tell 10.119.134.58, length 28
      11:27:21.518250 12b1dfc23c7ec_3 P   ARP, Reply 10.119.134.98 is-at 02:8e:62:00:01:42, length 28
      11:27:21.518582 a826ad88d47d6_3 Out ARP, Reply 10.119.134.98 is-at 02:8e:62:00:01:42, length 28
      

       

      Version-Release number of selected component (if applicable):

      4.18.8

      Actual results:

      • Packet failure is observed in the ping command ran from bastion.
      • PING fails from 

      Expected results:

      • Ping shouldn't be failing.

       

      Additional info:

      • Old must-gather - 0110-must-gather.local.9025741996085704360.tar.gz
      • Raw troubleshooting text file - 0340-issue_vm.txt
      • Source: Packet capture, OVN DB, OVS DB - 0270-source-134.31.tar.gz
      • Destination: Packet capture, OVN DB, OVS DB - 0280-dest-134.58.tar.gz
      • SOS of source node - 0300-sosreport-lben203vpm017u-2026-01-07-oafvpri.tar.xz
      • OVS commands from source: 0330-lben203vpm017u-ovs_command_logs.tar.gz
      • OVS commands from destination: 0320-lben213vpm007u-ovs_command_logs.tar.gz
      • Customer yet to upload latest MG 

              ovn-qe OVN QE
              rhn-support-pkhedeka Parikshit Khedekar
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: