-
Bug
-
Resolution: Not a Bug
-
Major
-
None
-
4.10.z
-
-
-
Important
-
None
-
Proposed
-
False
-
-
Description of problem:
Connections to the cluster are lost almost every time when installing out of tree driver. The state is permanent until the host server is manually rebooted.
I will share the link to sosreport collected when the issue was present. The following are the observation:
1. Kubelet service is inactive.
* kubelet.service - Kubernetes Kubelet Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/kubelet.service.d `-10-mco-default-madv.conf, 20-logging.conf, 20-nodenet.conf Active: inactive (dead)
2. NetworkManager-wait-online.service in the failed state.
* NetworkManager-wait-online.service - Network Manager Wait Online Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2022-12-12 14:47:17 UTC; 33min ago Docs: man:nm-online(1) Process: 3111 ExecStart=/usr/bin/nm-online -s -q (code=exited, status=1/FAILURE) Main PID: 3111 (code=exited, status=1/FAILURE) CPU: 347msDec 12 14:46:17 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: Starting Network Manager Wait Online... Dec 12 14:47:17 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE Dec 12 14:47:17 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'. Dec 12 14:47:17 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: Failed to start Network Manager Wait Online. Dec 12 14:47:17 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: NetworkManager-wait-online.service: Consumed 347ms CPU time
3. ovs-configuration.service in the failed state.
* ovs-configuration.service - Configures OVS with proper host networking configuration Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2022-12-12 14:49:41 UTC; 30min ago Process: 3632 ExecStart=/usr/local/bin/configure-ovs.sh OVNKubernetes (code=exited, status=1/FAILURE) Main PID: 3632 (code=exited, status=1/FAILURE) CPU: 5.391sDec 12 14:49:41 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net configure-ovs.sh[3632]: link/ether b4:96:91:c0:85:53 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 9702 numtxqueues 64 numrxqueues 64 gso_max_size 65536 gso_max_segs 65535 Dec 12 14:49:41 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: ovs-configuration.service: Failed with result 'exit-code'. Dec 12 14:49:42 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net configure-ovs.sh[3632]: + ip route show Dec 12 14:49:41 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: Failed to start Configures OVS with proper host networking configuration. Dec 12 14:49:42 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net configure-ovs.sh[3632]: 10.88.0.0/16 dev cni-podman0 proto kernel scope link src 10.88.0.1 linkdown Dec 12 14:49:41 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: ovs-configuration.service: Consumed 5.391s CPU time Dec 12 14:49:42 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net configure-ovs.sh[3632]: + ip -6 route show Dec 12 14:49:42 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net configure-ovs.sh[3632]: ::1 dev lo proto kernel metric 256 pref medium Dec 12 14:49:42 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net configure-ovs.sh[3632]: fe80::/64 dev cni-podman0 proto kernel metric 256 linkdown pref medium Dec 12 14:49:42 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net configure-ovs.sh[3632]: + exit 1
Version-Release number of selected component (if applicable):
4.10.32
How reproducible:
Always
Steps to Reproduce:
1. Apply the MC that will load the drivers. ( MC available in must-gather ) 2. During the mcp reboot the connection to the node will be lost.
Actual results:
Connection lost to the node after mcp reboot.
Expected results:
Connection should not lost with the node after mcp reboot.
Additional info:
Workaround: - Manually rebooting the node will restore the connections.