Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-44185

OCPBUG-44185: Network bonding configuration not working with fail_over_mac=follow

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.14.z, 4.15.z, 4.17.z, 4.16.z, 4.18.0
    • None
    • None
    • Proposed
    • False
    • Hide

      None

      Show
      None

      Description of problem:
      Bonded network configurations with mode=active-backup and fail_over_mac=follow are not functioning due to a race in /var/usrlocal/bin/configure-ovs.sh

      This race condition results in flapping.

      The customer who encountered the issue, in July, worked with the IBM LTC Power team to track the issue through the Linux Kernel to OVN-Kube and into the MCO configuration. The customer details can be shared in slack.

      The corresponding BZ https://bugzilla.linux.ibm.com/show_bug.cgi?id=210291 could not be mirrored.

      The GH issue is in https://github.com/openshift/machine-config-operator/issues/4605
      The fix is in https://github.com/openshift/machine-config-operator/pull/4609

      From Dave Wilder... the interfaces are setup as described in the issue...

      At this point the MACs of the bond's slaves (enP32807p1s0,enP49154p1s0) are the same. The purpose of fail_over_mac=follow is to insure the MACs will not be the same. This is preventing the bond from functioning. This initially appeared to be a problem with the bonding driver, after tracing the calls NetworkManager is making to the bonding driver I discovered the root of the problem is in configure-ovs.sh.

      The function: activate_nm_connections() attempts to activate all its generated profiles that are not currently in the "active" state. In my case the following profiles are activated one at a time in this order:
      br-ex, ovs-if-phys0, enP32807p1s0-slave-ovs-clone, enP49154p1s0-slave-ovs-clone, ovs-if-br-ex

      However the generated profiles have autoconnect-slaves set, therefore when br-ex is activated ovs-if-phys0, enP32807p1s0-slave-ovs-clone and enP49154p1s0-slave-ovs-clone's state changes to "activating", as we are only checking for the "activated" state these profiles may be activated again. As the list is walked, some of the profile's state will automatically go from activating to active. These interfaces are not activated a second time leaving the state of the bond in an unpredictable state. I am able to see in the bonding traces why both slave interface have the same MAC.

      My fix is to check for either activating or active states.

      — configure-ovs.sh 2024-09-20 15:29:03.160536239 -0700
      +++ configure-ovs.sh.patched 2024-09-20 15:33:38.040336032 -0700
      @@ -575,8 +575,8 @@

      1. But set the entry in master_interfaces to true if this is a slave
      2. Also set autoconnect to yes
        local active_state=$(nmcli -g GENERAL.STATE conn show "$conn")

      Version-Release number of selected component (if applicable): First seen in 4.14 OVN-Kube

      How reproducible: Specific OVN-Kube configuration with network bonding set for fail_over_mac=follow. This is the ideal setting for the SR-IOV/Network setup at the customer site where they rely on high availability.

      Steps to Reproduce:
      1. Setup the interfaces as described.

      Actual results: Failed Bonding

      Expected results: No flapping and the failover workers

      Additional info:
      https://github.com/openshift/machine-config-operator/issues/4605
      https://github.com/openshift/machine-config-operator/pull/4609
      #rhel-netorking-subsystem https://redhat-internal.slack.com/archives/C04NN96F1S4/p1719943109040989

              team-mco Team MCO
              pbastide_rh Paul Bastide
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              IBM Confidential Group, Partner Engineer
              Mick Tarsel
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: