Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-288

[OVN] bonding fails after active-backup fail-over and reboot, kargs static IP

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Undefined
    • Resolution: Done
    • 4.11
    • None
    • None

    Description

      +++ This bug was initially created as a clone of Bug #2103899 +++

      +++ This bug was initially created as a clone of Bug #2099945 +++

      Description of problem:

      vSphere UPI static IP active-backup bonding using kargs

      NetworkManager enters an infinite link flap loop after active-backup primary slave link is restored.

      Version-Release number of selected component (if applicable):

      How reproducible:

      Steps to Reproduce:
      1. disconnect link of primary slave (ens192) using vSphere console
      2. wait for link to fail over to backup slave (ens224).
      3. reboot
      4. re-connect old primary slave (ens192).

      Actual results:

      NetworkManager enters an infinite loop of link flaps. Network connectivity to the node is lost.

      Once old primary slave (ens192) is re-disconnected NetworkManager link flap stops and network recovers.

      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2206] device (bond0): assigned bond port ens192
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2207] device (ens192): Activation: connection 'ens192' enslaved, continuing activation
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2213] policy: auto-activating connection 'ens224-slave-ovs-clone' (1d428f7f-4ff6-42fd-ba2c-6831bd40544d)
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2214] policy: auto-activating connection 'ens256-slave-ovs-clone' (2f26484b-4855-4a68-b7cc-407f27d53546)
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2267] device (ens192): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2272] device (ens224): Activation: starting connection 'ens224-slave-ovs-clone' (1d428f7f-4ff6-42fd-ba2c-6831bd40544d)
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2273] device (ens256): Activation: starting connection 'ens256-slave-ovs-clone' (2f26484b-4855-4a68-b7cc-407f27d53546)
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2275] device (bond0): disconnecting for new activation request.
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2276] device (bond0): state change: ip-config -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2282] device (ens224): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2284] device (ens256): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2309] device (ens192): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2311] device (bond0): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2596] device (bond0): released bond slave ens192
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <warn> [1655849684.2676] device (ens192): queue-state[activated] reason:none, id:452056]: replace previously queued state change
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3073] device (bond0): Activation: starting connection 'ovs-if-phys0' (7ac68a21-25ac-47c4-a8ca-70a83181965d)
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3095] device (ens192): state change: secondaries -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3099] device (bond0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3243] device (bond0): set-hw-addr: set-cloned MAC address to 00:50:56:AC:59:95 (00:50:56:AC:59:95)
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3245] device (bond0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3294] device (ens224): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3297] device (ens256): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3300] device (bond0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3304] device (bond0): Activation: connection 'ovs-if-phys0' enslaved, continuing activation
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3306] device (ens192): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3338] device (ens224): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3639] device (bond0): assigned bond port ens224
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3640] device (ens224): Activation: connection 'ens224-slave-ovs-clone' enslaved, continuing activation
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3642] device (ens256): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3903] device (bond0): assigned bond port ens256
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3903] device (ens256): Activation: connection 'ens256-slave-ovs-clone' enslaved, continuing activation
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3906] device (bond0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3910] policy: auto-activating connection 'ens192' (2377489e-02ef-45db-bac0-06585c6c4fff)
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3912] device (bond0): carrier: link connected
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3931] device (ens224): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3934] device (ens256): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3939] device (ens192): Activation: starting connection 'ens192' (2377489e-02ef-45db-bac0-06585c6c4fff)
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3940] device (bond0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3943] device (bond0): disconnecting for new activation request.
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3943] device (bond0): state change: secondaries -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3944] device (bond0): releasing ovs interface bond0
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3946] device (bond0): released from master device bond0
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3948] device (ens192): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3953] device (ens224): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3954] device (ens224): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3957] device (ens224): Activation: successful, device activated.
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3963] device (ens256): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3965] device (ens256): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3967] device (ens256): Activation: successful, device activated.
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4014] device (bond0): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4349] device (bond0): released bond slave ens224
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4779] device (bond0): released bond slave ens256
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4781] device (bond0): set-hw-addr: set MAC address to 00:50:56:AC:59:95 (restore)
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4863] device (bond0): set-hw-addr: reset MAC address to D2:09:18:BB:91:74 (deactivate)
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4866] device (bond0): Activation: starting connection 'bond0' (ef21706f-9968-419c-877d-f3c80a098daf)
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4925] device (ens224): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4930] device (ens256): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4935] device (bond0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5079] device (bond0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5097] device (ens192): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5099] device (bond0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5101] device (ens224): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5109] device (ens256): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5120] device (ens192): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
      Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5380] device (bond0): assigned bond port ens192

      Expected results:

      No link flaps, no network disconnction.

      Additional info:

      Unable to reproduce yet on IPI libvirt baremetal.

      — Additional comment from rbrattai@redhat.com on 2022-06-22 04:47:51 UTC —

      Logs from compute-1

      http://file.rdu.redhat.com/~rbrattai/logs/bz-2099945-compute-1.tar.xz

      — Additional comment from rbrattai@redhat.com on 2022-06-22 05:05:13 UTC —

      Does not reproduce on RHEL8.6 worker with

      {ens192,ens224,bond0}

      .nmconnection

      — Additional comment from rbrattai@redhat.com on 2022-06-22 05:18:13 UTC —

      Logs with two interfaces instead of three.

      http://file.rdu.redhat.com/~rbrattai/logs/bz-2099945-compute-0.tar.xz

      — Additional comment from jcaamano@redhat.com on 2022-06-22 08:21:00 UTC —

      From configure-ovs perspective, this might be due to the fact that only slave active profiles are cloned. Since ens192 was down on reboot, it had no active profile, so no profile was cloned. Then when it is reconnected, the original profile activates, which then activates the original bond profile as well instead of the clone made for ovn-k, and so on...

      Not sure though about the loop or if it is worth looking at it since now things are not working as we expect anyway.

      This should work if there were no reboot.

      I guess that we ought to clone all slave profiles, active or not. We could filter instead by autoconnect being set.

      I don't consider this a regression. We probably introduced this issue when we started cloning the profiles, a while ago, and that was to solve another set of bonding issues that might had this scenario not working either.

      — Additional comment from jcaamano@redhat.com on 2022-06-22 10:56:26 UTC —

      @rbrattai@redhat.com

      Ross, as time allows, please try out this tentative improvement: https://github.com/openshift/machine-config-operator/pull/3203

      — Additional comment from aos-team-art-private@redhat.com on 2022-07-05 01:55:51 UTC —

      Elliott changed bug status from MODIFIED to ON_QA.
      This bug is expected to ship in the next 4.12 release.

      — Additional comment from rbrattai@redhat.com on 2022-07-26 18:50:58 UTC —

      Tested on 4.11.0-0.ci.test-2022-07-26-135031-ci-ln-yzxxi7k-latest

      vSphere UPI RHCOS static-ip active-backup fail_over_mac=0

      cloned the link down slave.

      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + for conn_uuid in $(nmcli -g UUID connection show)
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2356]: ++ nmcli -g connection.master connection show uuid 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + '[' 0aadf316-4058-441a-a11a-9ae7aa1806dd '!=' 0aadf316-4058-441a-a11a-9ae7aa1806dd ']'
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2360]: ++ nmcli -g GENERAL.STATE connection show 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + local active_state=
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2364]: ++ nmcli -g connection.autoconnect connection show 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + local autoconnect=yes
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + '[' '' '!=' activated ']'
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + '[' yes '!=' yes ']'
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + local new_uuid
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ clone_slave_connection 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ local uuid=6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ local old_name
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2369]: +++ nmcli -g connection.id connection show uuid 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ old_name=ens192
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ local new_name=ens192-slave-ovs-clone
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ nmcli connection show id ens192-slave-ovs-clone
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ nmcli connection clone 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2 ens192-slave-ovs-clone
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ nmcli -g connection.uuid connection show ens192-slave-ovs-clone
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + new_uuid=d0d12ff7-6a29-4ab1-bef1-1c52404129f1
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + nmcli conn mod uuid d0d12ff7-6a29-4ab1-bef1-1c52404129f1 connection.master cc20e638-a6c7-4ad4-b1fe-bda4599c3775
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + nmcli conn mod d0d12ff7-6a29-4ab1-bef1-1c52404129f1 connection.autoconnect-priority 100
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + nmcli conn mod d0d12ff7-6a29-4ab1-bef1-1c52404129f1 connection.autoconnect no
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + echo 'Replaced master 0aadf316-4058-441a-a11a-9ae7aa1806dd with cc20e638-a6c7-4ad4-b1fe-bda4599c3775 for slave profile d0d12ff7-6a29-4ab1-bef1-1c52404129f1'
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: Replaced master 0aadf316-4058-441a-a11a-9ae7aa1806dd with cc20e638-a6c7-4ad4-b1fe-bda4599c3775 for slave profile d0d12ff7-6a29-4ab1-bef1-1c52404129f1
      Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + for conn_uuid in $(nmcli -g UUID connection show)

      — Additional comment from aos-team-art-private@bot.bugzilla.redhat.com on 2022-08-11 21:05:07 UTC —

      Elliott changed bug status from MODIFIED to ON_QA.
      This bug is expected to ship in the next 4.11 release.

      — Additional comment from rbrattai@redhat.com on 2022-08-15 17:27:18 UTC —

      Fix in 4.11.0-0.nightly-2022-08-15-074436

      Attachments

        Issue Links

          Activity

            People

              jcaamano@redhat.com Jaime Caamaño Ruiz
              jcaamano@redhat.com Jaime Caamaño Ruiz
              Ross Brattain Ross Brattain
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: