Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-764

BZ#2307181 When Migrating VM port goes down and does not come back up until 30 minutes later.

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • None
    • None
    • ovn23.09
    • None
    • 5
    • False
    • Hide

      None

      Show
      None
    • False
    • rhel-sst-network-fastdatapath-ovn
    • ssg_networking
    • -

      Description of problem:
      Migrating VM port goes down and does not come back up until 30 minutes later.

      Version-Release number of selected component (if applicable):
      RHOSP 17.1.1
      RHEL 9.4

      Containers
      OVN 17.1.1-4
      Nova 17.1.1-4.1698918413
      Neutron 17.1.1-4.1700583234

      How reproducible:
      Always

      Steps to Reproduce:
      1. Migrate VM from one host to another, Port goes Down.
      2. Wait for ~30 minutes.
      3. Port comes back up.

      Actual results:
      Port on the VM is in down state and then go up 30 minutes later.

      Expected results:
      Port is not expected to go down after migration.

      Additional info:
      It looks like unbinding the port from the original host is setting the port down.

            [FDP-764] BZ#2307181 When Migrating VM port goes down and does not come back up until 30 minutes later.

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (ovn23.09 bug fix and enhancement update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHBA-2024:10896

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (ovn23.09 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2024:10896

            Ales Musil added a comment -

            just for reference the fix for this issue u/s is https://github.com/ovn-org/ovn/commit/16836c37

            Ales Musil added a comment - just for reference the fix for this issue u/s is https://github.com/ovn-org/ovn/commit/16836c37

            Ales Musil added a comment -

            Yeah I think this is good enough proof, the 25 ms vs 9 ms should be beyond error margin and different fluctuations of the run itself.

            Ales Musil added a comment - Yeah I think this is good enough proof, the 25 ms vs 9 ms should be beyond error margin and different fluctuations of the run itself.

            Jianlin Shi added a comment -

            tested with following script:

            systemctl start openvswitch
            systemctl start ovn-northd 
            ovn-nbctl set-connection ptcp:6641                                       
            ovn-sbctl set-connection ptcp:6642
            ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.86.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.86.25
            systemctl restart ovn-controller                                                                                             
            ovn-nbctl ls-add ls1                                                                   
            ovn-nbctl lsp-add ls1 ls1p1                                                                     
            ovn-nbctl lsp-set-addresses ls1p1 "00:00:00:01:01:01 192.168.1.1 2001::1"
            ovn-nbctl lsp-set-options ls1p1 requested-chassis=hv1
            ovn-nbctl lsp-add ls1 ls1p2 
            ovn-nbctl lsp-set-addresses ls1p2 "00:00:00:01:01:02 192.168.1.2 2001::2"
            ovn-nbctl lsp-add ls1 lp                                       
            ovn-nbctl lsp-set-type lp localport     
            ovn-nbctl lsp-set-addresses lp "00:00:00:01:01:11 192.168.1.11 2001::11"
                                                                
            ovn-nbctl lr-add lr1                                                
            ovn-nbctl lrp-add lr1 lr1-ls1 00:00:00:00:00:01 192.168.1.254/24 2001::a/64
            ovn-nbctl lsp-add ls1 ls1-lr1
            ovn-nbctl lsp-set-addresses ls1-lr1 "00:00:00:00:00:01 192.168.1.254 2001::a"
            ovn-nbctl lsp-set-type ls1-lr1 router
            ovn-nbctl lsp-set-options ls1-lr1 router-port=lr1-ls1    
                                              
            ovn-nbctl lrp-add lr1 lr1-ls2 00:00:00:00:00:02 192.168.2.254/24 2002::a/64
                                                           
            ovn-nbctl ls-add ls2                                          
            ovn-nbctl lsp-add ls2 ls2-lr1                              
            ovn-nbctl lsp-set-addresses ls2-lr1 "00:00:00:00:00:02 192.168.2.254 2002::a"
            ovn-nbctl lsp-set-type ls2-lr1 router
            ovn-nbctl lsp-set-options ls2-lr1 router-port=lr1-ls2
                                                                           
            ovn-nbctl lsp-add ls2 ls2p1             
            ovn-nbctl lsp-set-addresses ls2p1 "00:00:00:01:02:01 192.168.2.1 2002::1"
            ovn-nbctl lsp-add ls2 ls2p2                         
            ovn-nbctl lsp-set-addresses ls2p2 "00:00:00:01:02:02 192.168.2.2 2002::2"for i in {11..60}
            do
                for j in {11..22}
                do
                    ovn-nbctl lsp-add ls1 t_${i}_$j
                    ovn-nbctl lsp-set-addresses t_${i}_$j "00:00:00:00:$i:$j"
                    ip link add t_${i}_$j type veth peer name t_${i}_${j}_p
                    ovs-vsctl add-port br-int t_${i}_$j -- set interface t_${i}_$j  external_ids:iface-id=t_${i}_$j
                done
            done
                                                                             
            ip link add ls1p1 type veth peer name ls1p1_p
            #ovs-vsctl add-port br-int ls1p1 -- set interface ls1p1 type=internal external_ids:iface-id=ls1p1
            ovs-vsctl add-port br-int ls1p1 -- set interface ls1p1  external_ids:iface-id=ls1p1
            ovs-vsctl add-port br-int lp -- set interface lp type=internal external_ids:iface-id=lp
            ovs-vsctl add-port br-int ls2p1 -- set interface ls2p1 type=internal external_ids:iface-id=ls2p1
                                                                                            
            ovn-nbctl lsp-set-options ls1p1 requested-chassis=hv1,hv0
            ovs-vsctl set interface ls1p1  mtu_request=1400
            ovn-nbctl --print-wait-time --wait=hv sync 

            result on ovn23.09.4-33:

            + ovn-nbctl --print-wait-time --wait=hv sync                                             
            Time spent on processing nb_cfg 1:                                                       
                    ovn-northd delay before processing:     0ms                                      
                    ovn-northd completion:                  1ms                                      
                    ovn-controller(s) completion:           25ms 

            result on ovn23.09.6-6:

            + ovn-nbctl --print-wait-time --wait=hv sync                                             
            Time spent on processing nb_cfg 1:                                                       
                    ovn-northd delay before processing:     1ms                                      
                    ovn-northd completion:                  2ms                                      
                    ovn-controller(s) completion:           9ms 

            sync is faster on the fixed version, amusil@redhat.com how do you think about the result? can it prove that the problem is fixed?

            Jianlin Shi added a comment - tested with following script: systemctl start openvswitch systemctl start ovn-northd  ovn-nbctl set-connection ptcp:6641                                        ovn-sbctl set-connection ptcp:6642 ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.86.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.86.25 systemctl restart ovn-controller                                                                                              ovn-nbctl ls-add ls1                                                                    ovn-nbctl lsp-add ls1 ls1p1                                                                      ovn-nbctl lsp-set-addresses ls1p1 "00:00:00:01:01:01 192.168.1.1 2001::1" ovn-nbctl lsp-set-options ls1p1 requested-chassis=hv1 ovn-nbctl lsp-add ls1 ls1p2  ovn-nbctl lsp-set-addresses ls1p2 "00:00:00:01:01:02 192.168.1.2 2001::2" ovn-nbctl lsp-add ls1 lp                                        ovn-nbctl lsp-set-type lp localport      ovn-nbctl lsp-set-addresses lp "00:00:00:01:01:11 192.168.1.11 2001::11"                                                      ovn-nbctl lr-add lr1                                                 ovn-nbctl lrp-add lr1 lr1-ls1 00:00:00:00:00:01 192.168.1.254/24 2001::a/64 ovn-nbctl lsp-add ls1 ls1-lr1 ovn-nbctl lsp-set-addresses ls1-lr1 "00:00:00:00:00:01 192.168.1.254 2001::a" ovn-nbctl lsp-set-type ls1-lr1 router ovn-nbctl lsp-set-options ls1-lr1 router-port=lr1-ls1                                        ovn-nbctl lrp-add lr1 lr1-ls2 00:00:00:00:00:02 192.168.2.254/24 2002::a/64                                                 ovn-nbctl ls-add ls2                                           ovn-nbctl lsp-add ls2 ls2-lr1                               ovn-nbctl lsp-set-addresses ls2-lr1 "00:00:00:00:00:02 192.168.2.254 2002::a" ovn-nbctl lsp-set-type ls2-lr1 router ovn-nbctl lsp-set-options ls2-lr1 router-port=lr1-ls2                                                                 ovn-nbctl lsp-add ls2 ls2p1              ovn-nbctl lsp-set-addresses ls2p1 "00:00:00:01:02:01 192.168.2.1 2002::1" ovn-nbctl lsp-add ls2 ls2p2                          ovn-nbctl lsp-set-addresses ls2p2 "00:00:00:01:02:02 192.168.2.2 2002::2" for i in {11..60} do     for j in {11..22}     do         ovn-nbctl lsp-add ls1 t_${i}_$j         ovn-nbctl lsp-set-addresses t_${i}_$j "00:00:00:00:$i:$j"         ip link add t_${i}_$j type veth peer name t_${i}_${j}_p         ovs-vsctl add-port br- int t_${i}_$j -- set interface t_${i}_$j  external_ids:iface-id=t_${i}_$j     done done                                                                   ip link add ls1p1 type veth peer name ls1p1_p #ovs-vsctl add-port br- int ls1p1 -- set interface ls1p1 type=internal external_ids:iface-id=ls1p1 ovs-vsctl add-port br- int ls1p1 -- set interface ls1p1  external_ids:iface-id=ls1p1 ovs-vsctl add-port br- int lp -- set interface lp type=internal external_ids:iface-id=lp ovs-vsctl add-port br- int ls2p1 -- set interface ls2p1 type=internal external_ids:iface-id=ls2p1                                                                                  ovn-nbctl lsp-set-options ls1p1 requested-chassis=hv1,hv0 ovs-vsctl set interface ls1p1  mtu_request=1400 ovn-nbctl --print-wait-time --wait=hv sync result on ovn23.09.4-33: + ovn-nbctl --print-wait-time --wait=hv sync                                              Time spent on processing nb_cfg 1:                                                                ovn-northd delay before processing:     0ms                                               ovn-northd completion:                  1ms                                               ovn-controller(s) completion:           25ms result on ovn23.09.6-6: + ovn-nbctl --print-wait-time --wait=hv sync                                              Time spent on processing nb_cfg 1:                                                                ovn-northd delay before processing:     1ms                                               ovn-northd completion:                  2ms                                               ovn-controller(s) completion:           9ms sync is faster on the fixed version, amusil@redhat.com how do you think about the result? can it prove that the problem is fixed?

            Ales Musil added a comment -

            Hi rhn-support-jishi,

            the commit is "controller: Avoid quadratic complexity for multi-chassis ports." which should be available in 23.09.4-34 and further. To reproduce I did a setup of single LS with ~500 LSP with corresponding ovs interfaces. I have assgigned two chassis to one LSP e.g.

             ovn-nbctl lsp-set-options lsp-multi requested-chassis=hv1,hv2

             To trigger the problematic handler you need to update mtu of the ovs interface that corresponds to that multichassis LSP e.g.

             ovs-vsctl set interface lsp-multi mtu_request=1400

            There should be noticeable difference in wait time for sync command with and without the patch.

            ovn-nbctl --print-wait-time --wait=hv sync 

            Ales Musil added a comment - Hi rhn-support-jishi , the commit is "controller: Avoid quadratic complexity for multi-chassis ports." which should be available in 23.09.4-34 and further. To reproduce I did a setup of single LS with ~500 LSP with corresponding ovs interfaces. I have assgigned two chassis to one LSP e.g. ovn-nbctl lsp-set-options lsp-multi requested-chassis=hv1,hv2  To trigger the problematic handler you need to update mtu of the ovs interface that corresponds to that multichassis LSP e.g. ovs-vsctl set interface lsp-multi mtu_request=1400 There should be noticeable difference in wait time for sync command with and without the patch. ovn-nbctl --print-wait-time --wait=hv sync

            Jianlin Shi added a comment -

            amusil@redhat.com which patch fixed the issue? any suggestions about how to reproduce the issue?

            Jianlin Shi added a comment - amusil@redhat.com which patch fixed the issue? any suggestions about how to reproduce the issue?

            Hello:

            I've migrated this bug to FDP project. Ales Musil is the owner of this bug.

            Regards.

            Rodolfo Alonso added a comment - Hello: I've migrated this bug to FDP project. Ales Musil is the owner of this bug. Regards.

            Hello:

            I've changed the owner of the BZ in order to make the communication direct (and avoid being a proxy).

            Regards.

            Rodolfo Alonso added a comment - Hello: I've changed the owner of the BZ in order to make the communication direct (and avoid being a proxy). Regards.

            Hello rhn-support-ssigwald:

            I've been told to ping you as you own the case. Same question as before: Ales Musil has a potential fix that can be backported to 22.12 (see prio channel thread [1]). Is that OK to provide the customer this build to test it?

            Regards.

            Rodolfo Alonso added a comment - Hello rhn-support-ssigwald : I've been told to ping you as you own the case. Same question as before: Ales Musil has a potential fix that can be backported to 22.12 (see prio channel thread [1] ). Is that OK to provide the customer this build to test it? Regards.

              amusil@redhat.com Ales Musil
              jira-bugzilla-migration RH Bugzilla Integration
              Jianlin Shi Jianlin Shi
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: