This bug was originally reported via customer escalation[1] where the customer notices >20seconds network downtime when the original the leader(BFD prospective) chassis returns back online after an unexpected crash (e.g. power outage, kernel crash etc).
This can be reproduced on the latest OSP 17.1 release(Open vSwitch Library 3.0.90, ovn 22.12.3).
Reproduced in internally:
Here are the BFD controller chassis priorities
lrp-83e5d52a-a7ff-48c3-baaf-76e265769334_cf9c287c-7622-4dbb-8509-1e3698ff3240 3 #cont-2 lrp-83e5d52a-a7ff-48c3-baaf-76e265769334_20248462-28d4-4bd3-ba00-12235c54ea79 2 #cont-1 lrp-83e5d52a-a7ff-48c3-baaf-76e265769334_9b2f10a3-dcbd-4ec2-ae01-fb3b310f99c2 1 #cont-0
So the controller-2 is the BFD leader with the lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 residets. Also, the controller-2 is raft leader as well, just for info completeness.
Purposely crashed the controller-2 on 2/28 at 18:52:09. Note, it is important that the leader is crashed in this unexpected way. If you properly stop ovs/ovn systemctl services , the issue will not reproduce.
[root@controller-2 tripleo-admin] tripleo-admin]# date +"%T.%N" && echo c > /proc/sysrq-trigger
2/28/2024 18:52:09.394099804
The controller-1 takes over and claims the cr-lrp port shown in the ovn-controller. However, when the controller-2 comes back online ~18:54:50 the chassis that is currently the leader start flip flopping between releasing and claiming that port. Here is the ovn-controller log of the controller-1
2024-02-28T18:52:11.452Z|00273|binding|INFO|Changing chassis for lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from cf9c287c-7622-4dbb-8509-1e3698ff3240 to 20248462-28d4-4bd3-ba00-12235c54ea79. 2024-02-28T18:52:11.452Z|00274|binding|INFO|cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334: Claiming fa:16:3e:4b:84:60 10.0.0.191/24 2620:52:0:13b8::1000:6/64 2024-02-28T18:54:56.079Z|00275|binding|INFO|Releasing lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from this chassis (sb_readonly=0) 2024-02-28T18:54:56.079Z|00276|if_status|WARN|Trying to release unknown interface cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 2024-02-28T18:55:00.068Z|00277|binding|INFO|Claiming lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 for this chassis. 2024-02-28T18:55:00.068Z|00278|binding|INFO|cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334: Claiming fa:16:3e:4b:84:60 10.0.0.191/24 2620:52:0:13b8::1000:6/64 2024-02-28T18:55:01.873Z|00279|binding|INFO|Changing chassis for lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from cf9c287c-7622-4dbb-8509-1e3698ff3240 to 20248462-28d4-4bd3-ba00-12235c54ea79. 2024-02-28T18:55:01.873Z|00280|binding|INFO|cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334: Claiming fa:16:3e:4b:84:60 10.0.0.191/24 2620:52:0:13b8::1000:6/64 2024-02-28T18:55:02.373Z|00281|binding|INFO|Changing chassis for lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from cf9c287c-7622-4dbb-8509-1e3698ff3240 to 20248462-28d4-4bd3-ba00-12235c54ea79. 2024-02-28T18:55:02.373Z|00282|binding|INFO|cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334: Claiming fa:16:3e:4b:84:60 10.0.0.191/24 2620:52:0:13b8::1000:6/64 2024-02-28T18:55:02.588Z|00283|binding|INFO|Releasing lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from this chassis (sb_readonly=0)
I think this is because the original leader does not signal BFD enabled until few seconds later. From the controller-2 ovn-controller.log
2024-02-28T18:55:01.554Z|00022|ovn_bfd|INFO|Enabled BFD on interface ovn-111b70-0 2024-02-28T18:55:01.554Z|00023|ovn_bfd|INFO|Enabled BFD on interface ovn-202484-0 2024-02-28T18:55:01.554Z|00024|ovn_bfd|INFO|Enabled BFD on interface ovn-9b2f10-0 2024-02-28T18:55:01.559Z|00025|main|INFO|OVS feature set changed, force recompute. 2024-02-28T18:55:01.776Z|00026|binding|INFO|Changing chassis for lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from 20248462-28d4-4bd3-ba00-12235c54ea79 to cf9c287c-7622-4dbb-8509-1e3698ff3240.
This flip flopping of cr-lrp port causes network downtime for the customer. Reported 20second downtime, but their enviroment is much larger with many more cr-lrp compare to this simple example with a single cr-lrp. I have experienced ~20 icmp packet loss during this "flip flop" to the ovn router port IP and 17 icmp packet loss to VM routed via that ovn-router.
Mentioned this scenario on the ovn slack channel [2] and I am creating this bug with request info.
Last thing. This[3] patch seems very similar to this problem with one difference here is that the lrp is eventually claimed by in the flip flop fashion as above. However, that patch can help in this scenario(I think)
Requested logs (you might notice bunch of claiming and releasing of cr-lrp port BEFORE the 2/28/24 18:52, those were done with a proper shutdown of ovs/ovn services just for testing purposes) lrp_flip_flop_failover.tar.xz
[1] https://bugzilla.redhat.com/show_bug.cgi?id=2262654
[2] https://redhat-internal.slack.com/archives/C01G7T6SYSD/p1709150922487429
[3] https://mail.openvswitch.org/pipermail/ovs-discuss/2023-September/052688.html