-
Bug
-
Resolution: Done
-
Critical
-
None
-
4.12
-
Important
-
None
-
CNF Network Sprint 230
-
1
-
Rejected
-
False
-
Description of problem:
OCP 4.12 deployments making use of secondary bridge br-ex1 for CNI fail to start ovs-configuration service, with multiple failures.
Version-Release number of selected component (if applicable):
Openshift 4.12.0-rc.0 (2022-11-10)
How reproducible:
Until now always at least one node out of four workers fails, not always the same node, sometimes several nodes.
Steps to Reproduce:
1. Preparing to configure ipi on the provisioning node - RHEL 8 ( haproxy, named, mirror registry, rhcos_cache_server ..) 2. configuring the install-config.yaml (attached) - provisioningNetwork: enabled - machine network: single stack ipv4 - disconnected installation - ovn-kubernetes with hybrid-networking setup - LACP bonding setup using MC manifests at day1 * bond0 -> baremetal 192.168.32.0/24 (br-ex) * bond0.662 -> interface for secondary bridge (br-ex1) 192.168.66.128/26 - secondary bridge defined in /etc/ovnk/extra_bridge using MC Manifest 3. deploy the cluster - Usually the deployment is completed - Nodes show Ready status, but in some nodes ovs-configuration fails - Consequent MC changes fail because MCP cannot roll out configurations in nodes with the failure. NOTE: This impacts testing of our partners Verizon and F5, because we are validating their CNFs before OCP 4.12 release and we need a secondary bridge for CNI.
Actual results:
br-ex1 and all its related ovs-ports and interfaces fail to activate, ovs-configuration service fails.
Expected results:
br-ex1 and all its related ovs-ports and interfaces succeed to activate, ovs-configuration service starts successfully.
Additional info:
1. Nodes and MCP info
$ oc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready control-plane,master 7h59m v1.25.2+f33d98e master-1 Ready control-plane,master 7h59m v1.25.2+f33d98e master-2 Ready control-plane,master 8h v1.25.2+f33d98e worker-0 Ready worker 7h26m v1.25.2+f33d98e worker-1 Ready worker 7h25m v1.25.2+f33d98e worker-2 Ready worker 7h25m v1.25.2+f33d98e worker-3 Ready worker 7h25m v1.25.2+f33d98e $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-210a69a0b40162b2f349ea3a5b5819e5 True False False 3 3 3 0 7h57m worker rendered-worker-e8a62c86ce16e98e45e3166847484cf0 False True True 4 2 2 1 7h57m
2. When logging it to the nodes via SSH, we see when ovs-configuration fails, and from the ovs-configuration service logs, we see the following error: (full log attached worker-0-ovs-configuration.log)
$ ssh core@worker-0 --- Last login: Sat Nov 12 21:33:58 2022 from 192.168.62.10 [systemd] Failed Units: 3 NetworkManager-wait-online.service ovs-configuration.service stalld.service [core@worker-0 ~]$ sudo journalctl -u ovs-configuration | less ... Nov 12 15:27:54 worker-0 configure-ovs.sh[8237]: Error: invalid field 'connection.type'; allowed fields: NAME,UUID,TYPE,TIMESTAMP,TIMESTAMP-REAL,AUTOCONNECT,AUTOCONNECT-PRIORITY,READONLY,DBUS-PATH,ACT> Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' == vlan ']' Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: /usr/local/bin/configure-ovs.sh: line 178: [: ==: unary operator expected Nov 12 15:27:54 worker-0 configure-ovs.sh[8241]: ++ nmcli --get-values connection.type conn show Nov 12 15:27:54 worker-0 configure-ovs.sh[8241]: Error: invalid field 'connection.type'; allowed fields: NAME,UUID,TYPE,TIMESTAMP,TIMESTAMP-REAL,AUTOCONNECT,AUTOCONNECT-PRIORITY,READONLY,DBUS-PATH,ACT> Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' == bond ']' Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: /usr/local/bin/configure-ovs.sh: line 191: [: ==: unary operator expected Nov 12 15:27:54 worker-0 configure-ovs.sh[8245]: ++ nmcli --get-values connection.type conn show Nov 12 15:27:54 worker-0 configure-ovs.sh[8245]: Error: invalid field 'connection.type'; allowed fields: NAME,UUID,TYPE,TIMESTAMP,TIMESTAMP-REAL,AUTOCONNECT,AUTOCONNECT-PRIORITY,READONLY,DBUS-PATH,ACT> Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' == team ']' Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: /usr/local/bin/configure-ovs.sh: line 203: [: ==: unary operator expected Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + iface_type=802-3-ethernet Nov 12 15:27:54 worker-0 configure-ovs.sh[5576]: + '[' '!' '' = 0 ']'
3. We observe the failed node (worker-0) has ovs-if-phys1 connection as an ethernet type. But a working node (worker-1) shows a vlan type for the same connection with the vlan info
[core@worker-0 ~]$ sudo cat /etc/NetworkManager/system-connections/ovs-if-phys1.nmconnection [connection] id=ovs-if-phys1 uuid=aea14dc9-2d0c-4320-9c13-ddf3e64747bf type=ethernet autoconnect=false autoconnect-priority=100 autoconnect-slaves=1 interface-name=bond0.662 master=e61c56f7-f3ba-40f7-a1c1-37921fc6c815 slave-type=ovs-port [ethernet] cloned-mac-address=B8:83:03:91:C5:2C mtu=1500 [ovs-interface] type=system [core@worker-1 ~]$ sudo cat /etc/NetworkManager/system-connections/ovs-if-phys1.nmconnection [connection] id=ovs-if-phys1 uuid=9a019885-3cc1-4961-9dfa-6b7f996556c4 type=vlan autoconnect-priority=100 autoconnect-slaves=1 interface-name=bond0.662 master=877acf53-87d7-4cdf-a078-000af4f962c3 slave-type=ovs-port timestamp=1668265640 [ethernet] cloned-mac-address=B8:83:03:91:C5:E8 mtu=9000 [ovs-interface] type=system [vlan] flags=1 id=662 parent=bond0
4. Another problem we observe is that we specifically disable IPv6 in the the bond0.662 connection, but the generated connection for br-ex1 has ipv6 method-auto, and it should be disabled.
[core@worker-0 ~]$ sudo cat /etc/NetworkManager/system-connections/bond0.662.nmconnection [connection] id=bond0.662 type=vlan interface-name=bond0.662 autoconnect=true autoconnect-priority=99 [vlan] parent=bond0 id=662 [ethernet] mtu=9000 [ipv4] method=auto dhcp-timeout=2147483647 never-default=true [ipv6] method=disabled never-default=true [core@worker-0 ~]$ sudo cat /etc/NetworkManager/system-connections/br-ex1.nmconnection [connection] id=br-ex1 uuid=df67dcd9-4263-4707-9abc-eda16e75ea0d type=ovs-bridge autoconnect=false autoconnect-slaves=1 interface-name=br-ex1 [ethernet] mtu=1500 [ovs-bridge] [ipv4] method=auto [ipv6] addr-gen-mode=stable-privacy method=auto [proxy]
5. All journals, must-gather, some deployment files can be found in our CI console (Login with RedHat SSO) https://www.distributed-ci.io/jobs/46459571-900f-43df-8798-d36b322d26f4/files
But attached some of the logs to facilitate the task, worker-0 files are from the node with issues with ovs, worker-1 are from a worker that is OK in case you want to compare.
11_master-bonding.yaml 11_worker-bonding.yaml install-config.yaml journal-worker-0.log journal-worker-1.log must_gather.tar.gz sosreport-worker-0-2022-11-12-csbyqfe.tar.xz sosreport-worker-1-2022-11-12-ubltjdn.tar.xz worker-0-ip-nmcli-info.log worker-0-ovs-configuration.log worker-1-ip-nmcli-info.log worker-1-ovs-configuration.log
Please let us know if you need any additional information.
- blocks
-
OCPBUGS-6973 [IPI] Baremetal ovs-configure.sh script fails to start secondary bridge br-ex1
- Closed
- is cloned by
-
OCPBUGS-6973 [IPI] Baremetal ovs-configure.sh script fails to start secondary bridge br-ex1
- Closed
- links to