-
Bug
-
Resolution: Done-Errata
-
Normal
-
None
-
4.12
Description of problem:
Cluster was healthy but we needed to fix a bug [1] with NetworkManager and the hostname flapping so customer applied the machine config [2].
After the nodes rebooted, we noticed a few of the NNCEs (24/119) showed ConfigurationAborted:
$ oc get nodenetworkconfigurationenactments | awk '/ConfigurationAborted/ || /NAME/' NAME STATUS REASON worker-076.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-077.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-079.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-081.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-082.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-083.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-084.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-085.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-086.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-087.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-088.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-089.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-090.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-091.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-092.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-093.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-094.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-097.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-100.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-101.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-102.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-103.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-104.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted worker-112.<redacted-clustername>.wsnmacvlanpolicy-bond1 Aborted ConfigurationAborted
A sosreport from one of those above workers (worker-077) showed that all those interfaces created by nmstate did in-fact exist
$ less sosreport-worker-077-2023-12-05-cluzbsy/sos_commands/networking/ip_-d_address 30: bond1.260@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 vlan protocol 802.1Q id 260 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 31: bond1.270@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 vlan protocol 802.1Q id 270 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 32: bond1.271@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 vlan protocol 802.1Q id 271 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 33: bond1.272@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 vlan protocol 802.1Q id 272 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 34: bond1.273@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 vlan protocol 802.1Q id 273 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 35: bond1.280@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 vlan protocol 802.1Q id 280 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
Here's those interfaces expected to be configured
$ oc get nncp NAME STATUS REASON wsnmacvlanpolicy-bond1 Degraded FailedToConfigure$ oc get nncp wsnmacvlanpolicy-bond1 -o yaml apiVersion: nmstate.io/v1 kind: NodeNetworkConfigurationPolicy metadata: name: wsnmacvlanpolicy-bond1 spec: desiredState: interfaces: - name: bond1.271 state: up type: vlan vlan: base-iface: bond1 id: 271 - name: bond1.272 state: up type: vlan vlan: base-iface: bond1 id: 272 - name: bond1.270 state: up type: vlan vlan: base-iface: bond1 id: 270 - name: bond1.260 state: up type: vlan vlan: base-iface: bond1 id: 260 - name: bond1.280 state: up type: vlan vlan: base-iface: bond1 id: 280 - name: bond1.273 state: up type: vlan vlan: base-iface: bond1 id: 273 status: conditions: - lastHeartbeatTime: "2023-11-29T22:14:42Z" lastTransitionTime: "2023-11-29T22:14:34Z" reason: FailedToConfigure status: "False" type: Available - lastHeartbeatTime: "2023-11-29T22:14:42Z" lastTransitionTime: "2023-11-29T22:14:34Z" message: 0/119 nodes failed to configure, 24 nodes aborted configuration reason: FailedToConfigure status: "True" type: Degraded - lastHeartbeatTime: "2023-11-29T22:14:42Z" lastTransitionTime: "2023-11-29T22:14:34Z" reason: ConfigurationProgressing status: "False" type: Progressing lastUnavailableNodeCountUpdate: "2023-11-17T15:06:48Z"
There was a thought that it might've been due to IPv6 bug [3] as this message was found on the node around boot time in dmesg:
[ 565.632488] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x20]=0x11 status 0x1 [ 565.632508] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2 [ 565.632523] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2 [ 565.632538] __ib_cache_gid_add: unable to add gid 0000:0000:0000:0000:0000:ffff:cc97:5e5a error=-14 [ 565.632582] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x21]=0x11 status 0x1 [ 565.632595] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2 [ 565.632608] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2 [ 565.632623] __ib_cache_gid_add: unable to add gid 0000:0000:0000:0000:0000:ffff:cc97:5e5a error=-14 [ 565.632712] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x22]=0x11 status 0x1 [ 565.632725] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2 [ 565.632737] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2 [ 565.632751] __ib_cache_gid_add: unable to add gid 2600:40d0:0000:000e:00cd:0fd0:0000:00be error=-14 [ 565.632806] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x23]=0x11 status 0x1 [ 565.632818] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2 [ 565.632831] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2 [ 565.632845] __ib_cache_gid_add: unable to add gid 2600:40d0:0000:000e:00cd:0fd0:0000:00be error=-14 [ 565.633758] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x24]=0x11 status 0x1 [ 565.633760] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2 [ 565.633760] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2 [ 565.633762] __ib_cache_gid_add: unable to add gid fe80:0000:0000:0000:e63d:1aff:fe88:3fa0 error=-14 [ 565.633798] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x25]=0x11 status 0x1 [ 565.633799] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2 [ 565.633800] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2 [ 565.633802] __ib_cache_gid_add: unable to add gid fe80:0000:0000:0000:e63d:1aff:fe88:3fa0 error=-14[ 589.542320] IPv6: ADDRCONF(NETDEV_UP): bond1.260: link is not ready [ 589.549555] IPv6: ADDRCONF(NETDEV_UP): bond1.270: link is not ready [ 589.573421] IPv6: ADDRCONF(NETDEV_UP): bond1.271: link is not ready [ 589.579052] IPv6: ADDRCONF(NETDEV_UP): bond1.272: link is not ready [ 589.584746] IPv6: ADDRCONF(NETDEV_UP): bond1.273: link is not ready [ 589.590109] IPv6: ADDRCONF(NETDEV_UP): bond1.280: link is not ready
And looking at the sysctls showed IPv6 disabled
$ grep "disable_ipv6" sosreport-worker-077-2023-12-05-cluzbsy/sos_commands/kernel/sysctl_-a
net.ipv6.conf.all.disable_ipv6 = 0
net.ipv6.conf.bond1.disable_ipv6 = 0
net.ipv6.conf.bond1/241.disable_ipv6 = 0
net.ipv6.conf.bond1/260.disable_ipv6 = 1
net.ipv6.conf.bond1/270.disable_ipv6 = 1
net.ipv6.conf.bond1/271.disable_ipv6 = 1
net.ipv6.conf.bond1/272.disable_ipv6 = 1
net.ipv6.conf.bond1/273.disable_ipv6 = 1
net.ipv6.conf.bond1/280.disable_ipv6 = 1
However, other worker nodes that were reporting fine had these same sysctl values so I would likely disregard.
Looking at one of the pods (I've attached the pod logs to this Jira)
$ oc get pod -n openshift-nmstate nmstate-handler-xxgvk -o wide NAME READY STATUS RESTARTS AGE IP NODE nmstate-handler-xxgvk 1/1 Running 2 17d 204.151.100.90 worker-077.kub3-2.rch-mtce-1.vzwops.com
Ultimately, the fix for the 'ConfigurationAborted' was to delete the nmstate-handler pod on the nodes where it stated 'ConfigurationAborted'. After doing this, it corrected the STATUS automatically.
We're just trying to understand what prevented it from succeeding in the first place.
[1] https://issues.redhat.com/browse/OCPBUGS-11997 [2] https://issues.redhat.com/browse/OCPBUGS-11997?focusedId=22645417&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22645417 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2000052
Version-Release number of selected component (if applicable):
OCP 4.12.29
How reproducible:
It happened in 3 different clusters after applying a machine-config that sets the system hostname
Steps to Reproduce:
For customer, it was to apply this machineconfig and then post-reboot some nodes were ConfigurationAborted but I'm not sure that's what caused it as only a few nodes affected: https://issues.redhat.com/browse/OCPBUGS-11997?focusedId=22645417&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22645417
Actual results:
Overall, the NNCP was degraded because 24/119 NNCE had been marked 'Aborted'
Expected results:
NNCP healthy as well as all NNCE
Additional info:
- links to
-
RHBA-2024:1052 OpenShift Container Platform 4.12.z bug fix update