-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.14.z, 4.15.z, 4.17.z, 4.16.z, 4.18.0
-
None
-
None
-
Proposed
-
False
-
Description of problem:
Bonded network configurations with mode=active-backup and fail_over_mac=follow are not functioning due to a race in /var/usrlocal/bin/configure-ovs.sh
This race condition results in flapping.
The customer who encountered the issue, in July, worked with the IBM LTC Power team to track the issue through the Linux Kernel to OVN-Kube and into the MCO configuration. The customer details can be shared in slack.
The corresponding BZ https://bugzilla.linux.ibm.com/show_bug.cgi?id=210291 could not be mirrored.
The GH issue is in https://github.com/openshift/machine-config-operator/issues/4605
The fix is in https://github.com/openshift/machine-config-operator/pull/4609
From Dave Wilder... the interfaces are setup as described in the issue...
At this point the MACs of the bond's slaves (enP32807p1s0,enP49154p1s0) are the same. The purpose of fail_over_mac=follow is to insure the MACs will not be the same. This is preventing the bond from functioning. This initially appeared to be a problem with the bonding driver, after tracing the calls NetworkManager is making to the bonding driver I discovered the root of the problem is in configure-ovs.sh.
The function: activate_nm_connections() attempts to activate all its generated profiles that are not currently in the "active" state. In my case the following profiles are activated one at a time in this order:
br-ex, ovs-if-phys0, enP32807p1s0-slave-ovs-clone, enP49154p1s0-slave-ovs-clone, ovs-if-br-ex
However the generated profiles have autoconnect-slaves set, therefore when br-ex is activated ovs-if-phys0, enP32807p1s0-slave-ovs-clone and enP49154p1s0-slave-ovs-clone's state changes to "activating", as we are only checking for the "activated" state these profiles may be activated again. As the list is walked, some of the profile's state will automatically go from activating to active. These interfaces are not activated a second time leaving the state of the bond in an unpredictable state. I am able to see in the bonding traces why both slave interface have the same MAC.
My fix is to check for either activating or active states.
— configure-ovs.sh 2024-09-20 15:29:03.160536239 -0700
+++ configure-ovs.sh.patched 2024-09-20 15:33:38.040336032 -0700
@@ -575,8 +575,8 @@
- But set the entry in master_interfaces to true if this is a slave
- Also set autoconnect to yes
local active_state=$(nmcli -g GENERAL.STATE conn show "$conn")
Version-Release number of selected component (if applicable): First seen in 4.14 OVN-Kube
How reproducible: Specific OVN-Kube configuration with network bonding set for fail_over_mac=follow. This is the ideal setting for the SR-IOV/Network setup at the customer site where they rely on high availability.
Steps to Reproduce:
1. Setup the interfaces as described.
Actual results: Failed Bonding
Expected results: No flapping and the failover workers
Additional info:
https://github.com/openshift/machine-config-operator/issues/4605
https://github.com/openshift/machine-config-operator/pull/4609
#rhel-netorking-subsystem https://redhat-internal.slack.com/archives/C04NN96F1S4/p1719943109040989