Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-2885

Test Coverage: OVS with a global MAC the LOCAL port is not answering the ARPs

    • Icon: Task Task
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • ovs-dpdk
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      ( ) The test coverage is aligned with the epic's acceptance criteria

      Show
      ( ) The test coverage is aligned with the epic's acceptance criteria
    • rhel-9
    • None

      This task is tracking the test case writing activities to cover the bug described below.

       Problem Description: Clearly explain the issue.

      Open vSwitch (OVS) is unable to respond to ARP requests due to the absence of necessary OVS flows, which occurs when using a host-level VLAN configuration. This issue results in OVS failing to respond to ARP packets, consequently leading to communication disruptions among OCP peers.

       

      Note that this is happening when global MAC Is used since it get cloned from the NIC to the local PORT. When NIC MAC is local the ovs-bridge LOCAL port has its own generated mac address and issue do not get reproduced.

      Networking Architecture:

                                /-- bond0.280(vlan) --> br-ex(ovs-bridge - node IP)
      eno1,eno2 --> bond0 --> 
                                /-- bond0.282(vlan) --> br-sec (ovs-bridge - Used for VM networking in vlan 282 with the help of localnet NAD)

       

      This is a 3 nodes compact OCP Bare metal cluster, all node have two physical NICs, they form a active/backup bond(bond0), from the bond0, two VLAN interfaces are crated. OVS bridge br-ex is connected to bond0.280 and the ovs bridge br-sec is connected to bond0.282.  br-ex is on 10.92.80.0/20 network and br-sec is on  10.92.112.0/20. Three OCP nodes on their br-sec interface uses 10.92.119.{71-73} IP addresses. 

      Issue:

      Intermittently node communication fails over br-sec due to absence of ARP response from another peer.  Packet capture clearly confirms that ARP packet is correctly received on the destination PEER but the OVS doesn't send a response because of missing OVS flows. In a test case network ping from 10.92.119.71 to 10.92.119.72 doesn't work due to ARP resolution for the destination IP 10.92.119.72 fails. 

       

      • In the below packet capture ARP traffic for the 10.92.119.72  is seen on bond0.282 interface which is attached to the br-sec bridge as bridge port and receives traffic in the OVS layer.  
      • Packet is not forwarded to the br-sec interface instead the ARP packet is forwarded to the VM interfaces that are connected to br-sec bridge. 
      # tcpdump -B 20480 -s500 -i any arp -nnn | egrep "10.92.119.71|10.92.119.72"
      11:26:48.161277 bond0.282 B   ARP, Request who-has 10.92.119.72 tell 10.92.119.71, length 46
      11:26:48.161635 97f312f88fcb1_4 Out ARP, Request who-has 10.92.119.72 tell 10.92.119.71, length 46
      11:26:48.161637 c94fef0483e5c_4 Out ARP, Request who-has 10.92.119.72 tell 10.92.119.71, length 46
      11:26:48.161638 e8f1a02cb3c41_4 Out ARP, Request who-has 10.92.119.72 tell 10.92.119.71, length 46
       
      
      • As expected ARP resolution failed. 

       

      In the non-working case, ofproto/trace to the br-sec MAC has Datapath actions: drop 

       

      [root@slabnode2160 Ramesh-RH]# ovs-appctl ofproto/trace br-sec in_port=1,dl_dst=00:21:5a:9b:56:34
      Flow: in_port=1,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=00:21:5a:9b:56:34,dl_type=0x0000
      bridge("br-sec")
      ----------------
       0. priority 0
          NORMAL
           >>>> received packet on unknown port 1 <<<<
           >> no input bundle, dropping
      Final flow: unchanged
      
      Megaflow: recirc_id=0,eth,in_port=1,dl_src=00:00:00:00:00:00,dl_dst=00:21:5a:9b:56:34,dl_type=0x0000
      Datapath actions: drop
      

       

      Working Output

      • Manually added ARP entry for other peer nodes on each OCP node using arp -s command. 
      • Here ARP packet from bond0.282 is immediately forwarded to br-sec and there is ARP response. 

       

      [root@slabnode2160 /]# tcpdump -B 20480 -s500 -i any arp -nnn | egrep "10.92.119.71|10.92.119.72"
      11:46:25.732466 bond0.282 In  ARP, Request who-has 10.92.119.72 tell 10.92.119.71, length 46
      11:46:25.732759 br-sec In  ARP, Request who-has 10.92.119.72 tell 10.92.119.71, length 46
      11:46:25.732766 br-sec Out ARP, Reply 10.92.119.72 is-at 00:21:5a:9b:56:34, length 28

       

      In the working case, ofproto/trace has the correct OVS flow  for a VM's MAC running on the same system: 

      root@slabnode2160 Ramesh-RH]# ovs-appctl ofproto/trace br-sec in_port=2,dl_dst=02:cb:df:00:00:02
      Flow: in_port=2,vlan_tci=0x0000,dl_src=00:00:00:00:00:00,dl_dst=02:cb:df:00:00:02,dl_type=0x0000
      bridge("br-sec")
      ----------------
       0. priority 0
          NORMAL
           -> forwarding to learned port
      bridge("br-int")
      ----------------
       0. in_port=213,vlan_tci=0x0000/0x1000, priority 100, cookie 0x4f2ead9a
          set_field:0xae/0xffff->reg13
          set_field:0xac->reg11
          set_field:0xab->reg12
          set_field:0x6->metadata
          set_field:0x1->reg14
          set_field:0/0xffff0000->reg13
          resubmit(,8)
       8. metadata=0x6, priority 50, cookie 0x965695cb
          set_field:0/0x1000->reg10
          resubmit(,73)
          73. reg0=0x2, priority 0
                  drop
          move:NXM_NX_REG10[12]->NXM_NX_XXREG0[111]
           -> NXM_NX_XXREG0[111] is now 0
          resubmit(,9)
       9. metadata=0x6, priority 0, cookie 0x22278151
      
      ....
          75. reg15=0x2,metadata=0x6,dl_dst=02:cb:df:00:00:02, priority 85, cookie 0x7bc811e5
                  set_field:0/0x1000->reg10
          move:NXM_NX_REG10[12]->NXM_NX_XXREG0[111]
           -> NXM_NX_XXREG0[111] is now 0
          resubmit(,55)
      55. metadata=0x6, priority 0, cookie 0x9133fe27
          resubmit(,64)
      64. priority 0
          resubmit(,65)
      65. reg15=0x2,metadata=0x6, priority 100, cookie 0x7bc811e5
          output:212
      Final flow: unchanged
      Megaflow: pkt_mark=0,recirc_id=0,ct_state=-new-est-rpl-trk,eth,in_port=2,dl_src=00:00:00:00:00:00,dl_dst=02:cb:df:00:00:02,dl_type=0x0000
      Datapath actions: 160
      

       

      Data To check on support shell for 04274950 ticket: 

      •  0160-ramesh-ovs-data.tar.gz - Includes ovs flows, sosreport of 10.92.119.72, OVS flows, TCPDUMP, OVS flow cpture script. 
      • 0080-mg-15-Oct.tar.gz
      • 0120-Ping_Working_&_Not_working_tcpdump
      • retis data - 0140-icmp_retis_events-from-71-to-72_23-Oct-25.json 

       

       Impact Assessment: Describe the severity and impact (e.g., network down,availability of a workaround, etc.).

      • Node to node communication fails via the vlan interface which is attached to the OVS bridge(br-sec). 

       Software Versions

      Platform       : BareMetal
      ClusterID      : f4e5ed83-603e-4a54-b4ce-bd5310ddd8bd
      ClusterVersion : 4.18.20
      ClientVersion  : 4.18.0-202507080904.p0.g4fcb2d0.assembly.stream-4fcb2d0
      Image          : quay-io-openshift-release-dev-ocp-v4-0-art-dev
      
      bash-5.1# ovs-vsctl --version
      ovs-vsctl (Open vSwitch) 3.5.1-19.el9fdp
      DB Schema 8.8.0
      
      bash-5.1# rpm -qa | grep openvswitch
      openvswitch-selinux-extra-policy-1.0-39.el9fdp.noarch
      openvswitch3.5-3.5.0-19.el9fdp.x86_64
      
      
      

       Reproducibility:

      • Maybe the customer's network architecture can reproduce the issue. 

              ovsdpdk-triage ovsdpdk triage
              nstbot NST Bot
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: