Loading...

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: FDP-25.B
Affects Version/s: None
Component/s: ovn22.12
Labels:
- Triaged

Story Points:
13
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Acceptance Criteria:

Hide

Given a network with three controller chassis, where controller-2 (highest priority) crashes unexpectedly and controller-1 takes over as the BFD leader,

When controller-2 restarts and attempts to reclaim BFD leadership and the associated lport from controller-1 after approximately 2 minutes offline,

Then, controller-1 should maintain stable lport ownership during controller-2's downtime, ensuring no network interruptions and
upon controller-2’s recovery and reclamation of leadership, the lport should transition back to controller-2 without causing port "flip-flopping".

Show
Given a network with three controller chassis, where controller-2 (highest priority) crashes unexpectedly and controller-1 takes over as the BFD leader, When controller-2 restarts and attempts to reclaim BFD leadership and the associated lport from controller-1 after approximately 2 minutes offline, Then, controller-1 should maintain stable lport ownership during controller-2's downtime, ensuring no network interruptions and upon controller-2’s recovery and reclamation of leadership, the lport should transition back to controller-2 without causing port "flip-flopping".
Pool Team:

rhel-sst-network-fastdatapath
Intelligence Requested:
Market:
Sub-System Group:

ssg_networking

Sprint:
FDP 24.F, FDP 24.G, FDP 24.H, FDP 25.A, FDP 25.B
sprint_count:
5

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

This bug was originally reported via customer escalation[1] where the customer notices >20seconds network downtime when the original the leader(BFD prospective) chassis returns back online after an unexpected crash (e.g. power outage, kernel crash etc).

This can be reproduced on the latest OSP 17.1 release(Open vSwitch Library 3.0.90, ovn 22.12.3).

Reproduced in internally:

Here are the BFD controller chassis priorities

lrp-83e5d52a-a7ff-48c3-baaf-76e265769334_cf9c287c-7622-4dbb-8509-1e3698ff3240     3 #cont-2
lrp-83e5d52a-a7ff-48c3-baaf-76e265769334_20248462-28d4-4bd3-ba00-12235c54ea79     2 #cont-1
lrp-83e5d52a-a7ff-48c3-baaf-76e265769334_9b2f10a3-dcbd-4ec2-ae01-fb3b310f99c2     1 #cont-0

So the controller-2 is the BFD leader with the lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 residets. Also, the controller-2 is raft leader as well, just for info completeness.

Purposely crashed the controller-2 on 2/28 at 18:52:09. Note, it is important that the leader is crashed in this unexpected way. If you properly stop ovs/ovn systemctl services , the issue will not reproduce.

[root@controller-2 tripleo-admin] tripleo-admin]# date +"%T.%N" && echo c > /proc/sysrq-trigger 
2/28/2024 18:52:09.394099804

The controller-1 takes over and claims the cr-lrp port shown in the ovn-controller. However, when the controller-2 comes back online ~18:54:50 the chassis that is currently the leader start flip flopping between releasing and claiming that port. Here is the ovn-controller log of the controller-1

2024-02-28T18:52:11.452Z|00273|binding|INFO|Changing chassis for lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from cf9c287c-7622-4dbb-8509-1e3698ff3240 to 20248462-28d4-4bd3-ba00-12235c54ea79.
2024-02-28T18:52:11.452Z|00274|binding|INFO|cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334: Claiming fa:16:3e:4b:84:60 10.0.0.191/24 2620:52:0:13b8::1000:6/64
2024-02-28T18:54:56.079Z|00275|binding|INFO|Releasing lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from this chassis (sb_readonly=0)
2024-02-28T18:54:56.079Z|00276|if_status|WARN|Trying to release unknown interface cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334
2024-02-28T18:55:00.068Z|00277|binding|INFO|Claiming lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 for this chassis.
2024-02-28T18:55:00.068Z|00278|binding|INFO|cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334: Claiming fa:16:3e:4b:84:60 10.0.0.191/24 2620:52:0:13b8::1000:6/64
2024-02-28T18:55:01.873Z|00279|binding|INFO|Changing chassis for lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from cf9c287c-7622-4dbb-8509-1e3698ff3240 to 20248462-28d4-4bd3-ba00-12235c54ea79.
2024-02-28T18:55:01.873Z|00280|binding|INFO|cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334: Claiming fa:16:3e:4b:84:60 10.0.0.191/24 2620:52:0:13b8::1000:6/64
2024-02-28T18:55:02.373Z|00281|binding|INFO|Changing chassis for lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from cf9c287c-7622-4dbb-8509-1e3698ff3240 to 20248462-28d4-4bd3-ba00-12235c54ea79.
2024-02-28T18:55:02.373Z|00282|binding|INFO|cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334: Claiming fa:16:3e:4b:84:60 10.0.0.191/24 2620:52:0:13b8::1000:6/64
2024-02-28T18:55:02.588Z|00283|binding|INFO|Releasing lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from this chassis (sb_readonly=0)

I think this is because the original leader does not signal BFD enabled until few seconds later. From the controller-2 ovn-controller.log

2024-02-28T18:55:01.554Z|00022|ovn_bfd|INFO|Enabled BFD on interface ovn-111b70-0
2024-02-28T18:55:01.554Z|00023|ovn_bfd|INFO|Enabled BFD on interface ovn-202484-0
2024-02-28T18:55:01.554Z|00024|ovn_bfd|INFO|Enabled BFD on interface ovn-9b2f10-0
2024-02-28T18:55:01.559Z|00025|main|INFO|OVS feature set changed, force recompute.
2024-02-28T18:55:01.776Z|00026|binding|INFO|Changing chassis for lport cr-lrp-83e5d52a-a7ff-48c3-baaf-76e265769334 from 20248462-28d4-4bd3-ba00-12235c54ea79 to cf9c287c-7622-4dbb-8509-1e3698ff3240.

This flip flopping of cr-lrp port causes network downtime for the customer. Reported 20second downtime, but their enviroment is much larger with many more cr-lrp compare to this simple example with a single cr-lrp. I have experienced ~20 icmp packet loss during this "flip flop" to the ovn router port IP and 17 icmp packet loss to VM routed via that ovn-router.

Mentioned this scenario on the ovn slack channel [2] and I am creating this bug with request info.

Last thing. This[3] patch seems very similar to this problem with one difference here is that the lrp is eventually claimed by in the flip flop fashion as above. However, that patch can help in this scenario(I think)

Requested logs (you might notice bunch of claiming and releasing of cr-lrp port BEFORE the 2/28/24 18:52, those were done with a proper shutdown of ovs/ovn services just for testing purposes) lrp_flip_flop_failover.tar.xz

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2262654

[2] https://redhat-internal.slack.com/archives/C01G7T6SYSD/p1709150922487429

[3] https://mail.openvswitch.org/pipermail/ovs-discuss/2023-September/052688.html