-
Bug
-
Resolution: Done
-
Undefined
-
None
-
4.11
-
None
-
None
-
False
-
+++ This bug was initially created as a clone of Bug #2103899 +++
+++ This bug was initially created as a clone of Bug #2099945 +++
Description of problem:
vSphere UPI static IP active-backup bonding using kargs
NetworkManager enters an infinite link flap loop after active-backup primary slave link is restored.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. disconnect link of primary slave (ens192) using vSphere console
2. wait for link to fail over to backup slave (ens224).
3. reboot
4. re-connect old primary slave (ens192).
Actual results:
NetworkManager enters an infinite loop of link flaps. Network connectivity to the node is lost.
Once old primary slave (ens192) is re-disconnected NetworkManager link flap stops and network recovers.
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2206] device (bond0): assigned bond port ens192
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2207] device (ens192): Activation: connection 'ens192' enslaved, continuing activation
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2213] policy: auto-activating connection 'ens224-slave-ovs-clone' (1d428f7f-4ff6-42fd-ba2c-6831bd40544d)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2214] policy: auto-activating connection 'ens256-slave-ovs-clone' (2f26484b-4855-4a68-b7cc-407f27d53546)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2267] device (ens192): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2272] device (ens224): Activation: starting connection 'ens224-slave-ovs-clone' (1d428f7f-4ff6-42fd-ba2c-6831bd40544d)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2273] device (ens256): Activation: starting connection 'ens256-slave-ovs-clone' (2f26484b-4855-4a68-b7cc-407f27d53546)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2275] device (bond0): disconnecting for new activation request.
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2276] device (bond0): state change: ip-config -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2282] device (ens224): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2284] device (ens256): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2309] device (ens192): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2311] device (bond0): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.2596] device (bond0): released bond slave ens192
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <warn> [1655849684.2676] device (ens192): queue-state[activated] reason:none, id:452056]: replace previously queued state change
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3073] device (bond0): Activation: starting connection 'ovs-if-phys0' (7ac68a21-25ac-47c4-a8ca-70a83181965d)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3095] device (ens192): state change: secondaries -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3099] device (bond0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3243] device (bond0): set-hw-addr: set-cloned MAC address to 00:50:56:AC:59:95 (00:50:56:AC:59:95)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3245] device (bond0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3294] device (ens224): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3297] device (ens256): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3300] device (bond0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3304] device (bond0): Activation: connection 'ovs-if-phys0' enslaved, continuing activation
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3306] device (ens192): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3338] device (ens224): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3639] device (bond0): assigned bond port ens224
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3640] device (ens224): Activation: connection 'ens224-slave-ovs-clone' enslaved, continuing activation
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3642] device (ens256): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3903] device (bond0): assigned bond port ens256
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3903] device (ens256): Activation: connection 'ens256-slave-ovs-clone' enslaved, continuing activation
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3906] device (bond0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3910] policy: auto-activating connection 'ens192' (2377489e-02ef-45db-bac0-06585c6c4fff)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3912] device (bond0): carrier: link connected
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3931] device (ens224): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3934] device (ens256): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3939] device (ens192): Activation: starting connection 'ens192' (2377489e-02ef-45db-bac0-06585c6c4fff)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3940] device (bond0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3943] device (bond0): disconnecting for new activation request.
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3943] device (bond0): state change: secondaries -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3944] device (bond0): releasing ovs interface bond0
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3946] device (bond0): released from master device bond0
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3948] device (ens192): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3953] device (ens224): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3954] device (ens224): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3957] device (ens224): Activation: successful, device activated.
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3963] device (ens256): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3965] device (ens256): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.3967] device (ens256): Activation: successful, device activated.
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4014] device (bond0): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4349] device (bond0): released bond slave ens224
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4779] device (bond0): released bond slave ens256
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4781] device (bond0): set-hw-addr: set MAC address to 00:50:56:AC:59:95 (restore)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4863] device (bond0): set-hw-addr: reset MAC address to D2:09:18:BB:91:74 (deactivate)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4866] device (bond0): Activation: starting connection 'bond0' (ef21706f-9968-419c-877d-f3c80a098daf)
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4925] device (ens224): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4930] device (ens256): state change: activated -> deactivating (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.4935] device (bond0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5079] device (bond0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5097] device (ens192): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5099] device (bond0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5101] device (ens224): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5109] device (ens256): state change: deactivating -> disconnected (reason 'new-activation', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5120] device (ens192): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jun 21 22:14:44 compute-1 NetworkManager[1408]: <info> [1655849684.5380] device (bond0): assigned bond port ens192
Expected results:
No link flaps, no network disconnction.
Additional info:
Unable to reproduce yet on IPI libvirt baremetal.
— Additional comment from rbrattai@redhat.com on 2022-06-22 04:47:51 UTC —
Logs from compute-1
http://file.rdu.redhat.com/~rbrattai/logs/bz-2099945-compute-1.tar.xz
— Additional comment from rbrattai@redhat.com on 2022-06-22 05:05:13 UTC —
Does not reproduce on RHEL8.6 worker with
{ens192,ens224,bond0}.nmconnection
— Additional comment from rbrattai@redhat.com on 2022-06-22 05:18:13 UTC —
Logs with two interfaces instead of three.
http://file.rdu.redhat.com/~rbrattai/logs/bz-2099945-compute-0.tar.xz
— Additional comment from jcaamano@redhat.com on 2022-06-22 08:21:00 UTC —
From configure-ovs perspective, this might be due to the fact that only slave active profiles are cloned. Since ens192 was down on reboot, it had no active profile, so no profile was cloned. Then when it is reconnected, the original profile activates, which then activates the original bond profile as well instead of the clone made for ovn-k, and so on...
Not sure though about the loop or if it is worth looking at it since now things are not working as we expect anyway.
This should work if there were no reboot.
I guess that we ought to clone all slave profiles, active or not. We could filter instead by autoconnect being set.
I don't consider this a regression. We probably introduced this issue when we started cloning the profiles, a while ago, and that was to solve another set of bonding issues that might had this scenario not working either.
— Additional comment from jcaamano@redhat.com on 2022-06-22 10:56:26 UTC —
@rbrattai@redhat.com
Ross, as time allows, please try out this tentative improvement: https://github.com/openshift/machine-config-operator/pull/3203
— Additional comment from aos-team-art-private@redhat.com on 2022-07-05 01:55:51 UTC —
Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.12 release.
— Additional comment from rbrattai@redhat.com on 2022-07-26 18:50:58 UTC —
Tested on 4.11.0-0.ci.test-2022-07-26-135031-ci-ln-yzxxi7k-latest
vSphere UPI RHCOS static-ip active-backup fail_over_mac=0
cloned the link down slave.
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + for conn_uuid in $(nmcli -g UUID connection show)
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2356]: ++ nmcli -g connection.master connection show uuid 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + '[' 0aadf316-4058-441a-a11a-9ae7aa1806dd '!=' 0aadf316-4058-441a-a11a-9ae7aa1806dd ']'
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2360]: ++ nmcli -g GENERAL.STATE connection show 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + local active_state=
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2364]: ++ nmcli -g connection.autoconnect connection show 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + local autoconnect=yes
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + '[' '' '!=' activated ']'
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + '[' yes '!=' yes ']'
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + local new_uuid
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ clone_slave_connection 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ local uuid=6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ local old_name
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2369]: +++ nmcli -g connection.id connection show uuid 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ old_name=ens192
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ local new_name=ens192-slave-ovs-clone
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ nmcli connection show id ens192-slave-ovs-clone
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ nmcli connection clone 6bf974b9-a00d-4d7b-9c8c-5f04a547f1c2 ens192-slave-ovs-clone
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[2368]: ++ nmcli -g connection.uuid connection show ens192-slave-ovs-clone
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + new_uuid=d0d12ff7-6a29-4ab1-bef1-1c52404129f1
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + nmcli conn mod uuid d0d12ff7-6a29-4ab1-bef1-1c52404129f1 connection.master cc20e638-a6c7-4ad4-b1fe-bda4599c3775
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + nmcli conn mod d0d12ff7-6a29-4ab1-bef1-1c52404129f1 connection.autoconnect-priority 100
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + nmcli conn mod d0d12ff7-6a29-4ab1-bef1-1c52404129f1 connection.autoconnect no
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + echo 'Replaced master 0aadf316-4058-441a-a11a-9ae7aa1806dd with cc20e638-a6c7-4ad4-b1fe-bda4599c3775 for slave profile d0d12ff7-6a29-4ab1-bef1-1c52404129f1'
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: Replaced master 0aadf316-4058-441a-a11a-9ae7aa1806dd with cc20e638-a6c7-4ad4-b1fe-bda4599c3775 for slave profile d0d12ff7-6a29-4ab1-bef1-1c52404129f1
Jul 26 16:50:54 62pzx-compute-0 configure-ovs.sh[1847]: + for conn_uuid in $(nmcli -g UUID connection show)
— Additional comment from aos-team-art-private@bot.bugzilla.redhat.com on 2022-08-11 21:05:07 UTC —
Elliott changed bug status from MODIFIED to ON_QA.
This bug is expected to ship in the next 4.11 release.
— Additional comment from rbrattai@redhat.com on 2022-08-15 17:27:18 UTC —
Fix in 4.11.0-0.nightly-2022-08-15-074436
- clones
-
OCPBUGS-287 [OVN] bonding fails after active-backup fail-over and reboot, kargs static IP
- Closed
- is blocked by
-
OCPBUGS-287 [OVN] bonding fails after active-backup fail-over and reboot, kargs static IP
- Closed
- links to