-
Bug
-
Resolution: Can't Do
-
Normal
-
None
-
None
-
None
-
5
-
False
-
-
False
-
rhel-9
-
None
-
rhel-net-ovs-dpdk
-
-
-
ssg_networking
-
OVS/DPDK - FDP-25.E - 1
-
1
Problem Description:
Cloned from https://issues.redhat.com/browse/OSPRH-16179:
Customer reports two failover events, 2025-05-05T04:42 (supportshell logs 0090-0140) and 2025-05-06T02:46 (supportshell logs 0150-0200). In both cases, ovs-vswitchd logs show BFD errors for ~2 seconds on lpctrl-5002 and lpctrl-5003 at which point the ovn-controller processes start claiming router ports as they should. Failover happens quickly, but customer assumes that advertisements from multiple locations is confusing their switches. lpctrl-5001 did not have an issue in either case, but it's possible that is just due to small sample size.
The primary question is "Why are the routers failing over?" and the answer appears to be "chassis did not respond to BFD pings quick enough, so OVN did what it was supposed to do–move the ports". vswitchd doesn't appear to be showing excessive load around the time of the failovers. There are lots of "Transaction causes multiple rows in \"MAC_Binding\" table to have identical values" errors in ovn-controller.log which the customer mentions, which do cause recomputes–but they do not seem to be the cause of the issue as the recomputes take around 0.5s and don't appear to be happening adjacent to the BFD timeouts.
In short, Neutron/OVN seem to be working as intended, reacting to a brief bit of unreachability. It may be that bumping the BFD timeouts is necessary to match the customer's hardware and load. Looking for input from the FDP team to verify and suggest mitigation strategies.
Impact Assessment: Network downtime when this happens
Software Versions: openvswitch3.1-3.1.0-104.el9fdp.x86_64 (RHOSP 17.1.3)
Reproducibility: Seems to happen multiple times per week, sometimes multiple times per day
Logs: these are on supportshell
- duplicates
-
OSPRH-16179 The ovn cluster falls apart frequently and then shuffles around the routers.
-
- Closed
-
- is duplicated by
-
OSPRH-16179 The ovn cluster falls apart frequently and then shuffles around the routers.
-
- Closed
-