Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-441

CR-LRP port flips flops after BFD failover due to unexpected chassis failure

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • ovn22.12
    • 13
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      Given a network with three controller chassis, where controller-2 (highest priority) crashes unexpectedly and controller-1 takes over as the BFD leader,

      When controller-2 restarts and attempts to reclaim BFD leadership and the associated lport from controller-1 after approximately 2 minutes offline,

      Then, controller-1 should maintain stable lport ownership during controller-2's downtime, ensuring no network interruptions and
      upon controller-2’s recovery and reclamation of leadership, the lport should transition back to controller-2 without causing port "flip-flopping".

      Show
      Given a network with three controller chassis, where controller-2 (highest priority) crashes unexpectedly and controller-1 takes over as the BFD leader, When controller-2 restarts and attempts to reclaim BFD leadership and the associated lport from controller-1 after approximately 2 minutes offline, Then, controller-1 should maintain stable lport ownership during controller-2's downtime, ensuring no network interruptions and upon controller-2’s recovery and reclamation of leadership, the lport should transition back to controller-2 without causing port "flip-flopping".
    • FDP 24.F

      This bug was originally reported via customer escalation[1] where the customer notices >20seconds network downtime when the original the leader(BFD prospective) chassis returns back online after an unexpected crash (e.g. power outage, kernel crash etc).

      This can be reproduced on the latest OSP 17.1 release(Open vSwitch Library 3.0.90, ovn 22.12.3).

      Reproduced in internally:

      Here are the BFD controller chassis priorities

       

      lrp-83e5d52a-a7ff-48c3-baaf-76e265769334_cf9c287c-7622-4dbb-8509-1e3698ff3240     3 #cont-2
      lrp-83e5d52a-a7ff-48c3-baaf-76e265769334_20248462-28d4-4bd3-ba00-12235c54ea79     2 #cont-1
      lrp-83e5d52a-a7ff-48c3-baaf-76e265769334_9b2f10a3-dcbd-4ec2-ae01-fb3b310f99c2     1 #cont-0 

       

      So the controller-2 is the BFD leader with the lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 residets. Also, the controller-2 is raft leader as well, just for info completeness.

      Purposely crashed the controller-2 on 2/28 at 18:52:09. Note, it is important that the leader is crashed in this unexpected way. If you properly stop ovs/ovn systemctl services , the issue will not reproduce.

      [root@controller-2 tripleo-admin] tripleo-admin]# date +"%T.%N" && echo c > /proc/sysrq-trigger 
      2/28/2024 18:52:09.394099804 

      The controller-1 takes over and claims the cr-lrp port shown in the ovn-controller. However, when the controller-2 comes back online ~18:54:50 the chassis that is currently the leader start flip flopping between releasing and claiming that port. Here is the ovn-controller log of the controller-1

      2024-02-28T18:52:11.452Z|00273|binding|INFO|Changing chassis for lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from cf9c287c-7622-4dbb-8509-1e3698ff3240 to 20248462-28d4-4bd3-ba00-12235c54ea79.
      2024-02-28T18:52:11.452Z|00274|binding|INFO|cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334: Claiming fa:16:3e:4b:84:60 10.0.0.191/24 2620:52:0:13b8::1000:6/64
      2024-02-28T18:54:56.079Z|00275|binding|INFO|Releasing lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from this chassis (sb_readonly=0)
      2024-02-28T18:54:56.079Z|00276|if_status|WARN|Trying to release unknown interface cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334
      2024-02-28T18:55:00.068Z|00277|binding|INFO|Claiming lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 for this chassis.
      2024-02-28T18:55:00.068Z|00278|binding|INFO|cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334: Claiming fa:16:3e:4b:84:60 10.0.0.191/24 2620:52:0:13b8::1000:6/64
      2024-02-28T18:55:01.873Z|00279|binding|INFO|Changing chassis for lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from cf9c287c-7622-4dbb-8509-1e3698ff3240 to 20248462-28d4-4bd3-ba00-12235c54ea79.
      2024-02-28T18:55:01.873Z|00280|binding|INFO|cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334: Claiming fa:16:3e:4b:84:60 10.0.0.191/24 2620:52:0:13b8::1000:6/64
      2024-02-28T18:55:02.373Z|00281|binding|INFO|Changing chassis for lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from cf9c287c-7622-4dbb-8509-1e3698ff3240 to 20248462-28d4-4bd3-ba00-12235c54ea79.
      2024-02-28T18:55:02.373Z|00282|binding|INFO|cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334: Claiming fa:16:3e:4b:84:60 10.0.0.191/24 2620:52:0:13b8::1000:6/64
      2024-02-28T18:55:02.588Z|00283|binding|INFO|Releasing lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from this chassis (sb_readonly=0) 

      I think this is because the original leader does not signal BFD enabled until few seconds later. From the controller-2 ovn-controller.log

      2024-02-28T18:55:01.554Z|00022|ovn_bfd|INFO|Enabled BFD on interface ovn-111b70-0
      2024-02-28T18:55:01.554Z|00023|ovn_bfd|INFO|Enabled BFD on interface ovn-202484-0
      2024-02-28T18:55:01.554Z|00024|ovn_bfd|INFO|Enabled BFD on interface ovn-9b2f10-0
      2024-02-28T18:55:01.559Z|00025|main|INFO|OVS feature set changed, force recompute.
      2024-02-28T18:55:01.776Z|00026|binding|INFO|Changing chassis for lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from 20248462-28d4-4bd3-ba00-12235c54ea79 to cf9c287c-7622-4dbb-8509-1e3698ff3240. 

      This flip flopping of cr-lrp port causes network downtime for the customer. Reported 20second downtime, but their enviroment is much larger with many more cr-lrp compare to this simple example with a single cr-lrp. I have experienced ~20 icmp packet loss during this "flip flop" to the ovn router port IP and 17 icmp packet loss to VM routed via that ovn-router.

      Mentioned this scenario on the ovn slack channel [2]  and I am creating this bug with request info.

      Last thing. This[3] patch seems very similar to this problem with one difference here is that the lrp is eventually claimed by in the flip flop fashion as above. However, that patch can help in this scenario(I think)

       

      Requested logs (you might notice bunch of claiming and releasing of cr-lrp port BEFORE the 2/28/24 18:52, those were done with  a proper shutdown of ovs/ovn services just for testing purposes) lrp_flip_flop_failover.tar.xz

      [1] https://bugzilla.redhat.com/show_bug.cgi?id=2262654

      [2] https://redhat-internal.slack.com/archives/C01G7T6SYSD/p1709150922487429

      [3] https://mail.openvswitch.org/pipermail/ovs-discuss/2023-September/052688.html

            xsimonar@redhat.com Xavier Simonart
            mtomaska@redhat.com Miro Tomaska
            Jianlin Shi Jianlin Shi
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: