[OCPBUGS-44185] Using nmcli to activate (up) or deactivate (down) the active slaves breaks the bond. - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: None
Affects Version/s: 4.14.z, 4.15.z, 4.17.z, 4.16.z, 4.18.0
Component/s: Machine Config Operator
Labels:
None

Regression:
None
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.18.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:
Bonded network configurations with mode=active-backup and fail_over_mac=follow are not functioning due to a race in /var/usrlocal/bin/configure-ovs.sh

This race condition results in flapping.

The customer who encountered the issue, in July, worked with the IBM LTC Power team to track the issue through the Linux Kernel to OVN-Kube and into the MCO configuration. The customer details can be shared in slack.

The corresponding BZ https://bugzilla.linux.ibm.com/show_bug.cgi?id=210291 could not be mirrored.

The GH issue is in https://github.com/openshift/machine-config-operator/issues/4605
The fix is in https://github.com/openshift/machine-config-operator/pull/4609

From Dave Wilder... the interfaces are setup as described in the issue...

At this point the MACs of the bond's slaves (enP32807p1s0,enP49154p1s0) are the same. The purpose of fail_over_mac=follow is to insure the MACs will not be the same. This is preventing the bond from functioning. This initially appeared to be a problem with the bonding driver, after tracing the calls NetworkManager is making to the bonding driver I discovered the root of the problem is in configure-ovs.sh.

The function: activate_nm_connections() attempts to activate all its generated profiles that are not currently in the "active" state. In my case the following profiles are activated one at a time in this order:
br-ex, ovs-if-phys0, enP32807p1s0-slave-ovs-clone, enP49154p1s0-slave-ovs-clone, ovs-if-br-ex

However the generated profiles have autoconnect-slaves set, therefore when br-ex is activated ovs-if-phys0, enP32807p1s0-slave-ovs-clone and enP49154p1s0-slave-ovs-clone's state changes to "activating", as we are only checking for the "activated" state these profiles may be activated again. As the list is walked, some of the profile's state will automatically go from activating to active. These interfaces are not activated a second time leaving the state of the bond in an unpredictable state. I am able to see in the bonding traces why both slave interface have the same MAC.

My fix is to check for either activating or active states.

— configure-ovs.sh 2024-09-20 15:29:03.160536239 -0700
+++ configure-ovs.sh.patched 2024-09-20 15:33:38.040336032 -0700
@@ -575,8 +575,8 @@

But set the entry in master_interfaces to true if this is a slave

Also set autoconnect to yes
local active_state=$(nmcli -g GENERAL.STATE conn show "$conn")

Version-Release number of selected component (if applicable): First seen in 4.14 OVN-Kube

How reproducible: Specific OVN-Kube configuration with network bonding set for fail_over_mac=follow. This is the ideal setting for the SR-IOV/Network setup at the customer site where they rely on high availability.

Steps to Reproduce:
1. Setup the interfaces as described.

Actual results: Failed Bonding

Expected results: No flapping and the failover workers

Additional info:
https://github.com/openshift/machine-config-operator/issues/4605
https://github.com/openshift/machine-config-operator/pull/4609
#rhel-netorking-subsystem https://redhat-internal.slack.com/archives/C04NN96F1S4/p1719943109040989

is related to

OCPBUGS-46069 [OVN] handle case when br-ex MAC != bond MAC

links to

openshift/machine-config-operator#4609: OCPBUGS-44185: Race in configure-ovs.sh affects bonding interface configuration.

RHBA-2025:3066 OpenShift Container Platform 4.18.z bug fix update

Assignee:: Team MCO

Reporter:: Paul Bastide

QA Contact:: Sergio Regidor de la Rosa

Contributing Groups:: IBM Confidential Group, Partner Engineer

Contributors:: Mick Tarsel

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2024/11/04 2:03 PM

Updated:: 2025/03/25 6:59 AM

Resolved:: 2025/03/25 6:59 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide