Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Networking / kubernetes-nmstate
Labels:
- 5G_Telco
- telco-5g

Activity Type:
Quality / Stability / Reliability
Blocked:
True
Blocked Reason:

Hide

Need to merge 4.14 first which is blocked because CI permafail

Show
Need to merge 4.14 first which is blocked because CI permafail
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:

4.12.z
Target Version:

4.12.z
Release Blocker:
None
Sprint:
None

RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Priority Data:
PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Release Note Not Required
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Cluster was healthy but we needed to fix a bug [1] with NetworkManager and the hostname flapping so customer applied the machine config [2].

After the nodes rebooted, we noticed a few of the NNCEs (24/119) showed ConfigurationAborted:

$ oc get nodenetworkconfigurationenactments | awk '/ConfigurationAborted/ || /NAME/'
NAME                                                       STATUS      REASON
worker-076.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-077.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-079.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-081.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-082.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-083.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-084.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-085.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-086.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-087.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-088.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-089.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-090.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-091.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-092.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-093.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-094.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-097.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-100.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-101.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-102.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-103.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-104.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
worker-112.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted

A sosreport from one of those above workers (worker-077) showed that all those interfaces created by nmstate did in-fact exist

$ less sosreport-worker-077-2023-12-05-cluzbsy/sos_commands/networking/ip_-d_address 
30: bond1.260@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 
    vlan protocol 802.1Q id 260 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
31: bond1.270@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 
    vlan protocol 802.1Q id 270 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
32: bond1.271@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 
    vlan protocol 802.1Q id 271 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
33: bond1.272@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 
    vlan protocol 802.1Q id 272 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
34: bond1.273@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 
    vlan protocol 802.1Q id 273 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
35: bond1.280@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 
    vlan protocol 802.1Q id 280 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535

Here's those interfaces expected to be configured

$ oc get nncp
NAME                     STATUS     REASON
wsnmacvlanpolicy-bond1   Degraded   FailedToConfigure$ oc get nncp wsnmacvlanpolicy-bond1 -o yaml
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: wsnmacvlanpolicy-bond1
spec:
  desiredState:
    interfaces:
    - name: bond1.271
      state: up
      type: vlan
      vlan:
        base-iface: bond1
        id: 271
    - name: bond1.272
      state: up
      type: vlan
      vlan:
        base-iface: bond1
        id: 272
    - name: bond1.270
      state: up
      type: vlan
      vlan:
        base-iface: bond1
        id: 270
    - name: bond1.260
      state: up
      type: vlan
      vlan:
        base-iface: bond1
        id: 260
    - name: bond1.280
      state: up
      type: vlan
      vlan:
        base-iface: bond1
        id: 280
    - name: bond1.273
      state: up
      type: vlan
      vlan:
        base-iface: bond1
        id: 273
status:
  conditions:
  - lastHeartbeatTime: "2023-11-29T22:14:42Z"
    lastTransitionTime: "2023-11-29T22:14:34Z"
    reason: FailedToConfigure
    status: "False"
    type: Available
  - lastHeartbeatTime: "2023-11-29T22:14:42Z"
    lastTransitionTime: "2023-11-29T22:14:34Z"
    message: 0/119 nodes failed to configure, 24 nodes aborted configuration
    reason: FailedToConfigure
    status: "True"
    type: Degraded
  - lastHeartbeatTime: "2023-11-29T22:14:42Z"
    lastTransitionTime: "2023-11-29T22:14:34Z"
    reason: ConfigurationProgressing
    status: "False"
    type: Progressing
  lastUnavailableNodeCountUpdate: "2023-11-17T15:06:48Z"

There was a thought that it might've been due to IPv6 bug [3] as this message was found on the node around boot time in dmesg:

[  565.632488] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x20]=0x11 status 0x1
[  565.632508] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
[  565.632523] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2
[  565.632538] __ib_cache_gid_add: unable to add gid 0000:0000:0000:0000:0000:ffff:cc97:5e5a error=-14
[  565.632582] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x21]=0x11 status 0x1
[  565.632595] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
[  565.632608] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2
[  565.632623] __ib_cache_gid_add: unable to add gid 0000:0000:0000:0000:0000:ffff:cc97:5e5a error=-14
[  565.632712] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x22]=0x11 status 0x1
[  565.632725] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
[  565.632737] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2
[  565.632751] __ib_cache_gid_add: unable to add gid 2600:40d0:0000:000e:00cd:0fd0:0000:00be error=-14
[  565.632806] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x23]=0x11 status 0x1
[  565.632818] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
[  565.632831] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2
[  565.632845] __ib_cache_gid_add: unable to add gid 2600:40d0:0000:000e:00cd:0fd0:0000:00be error=-14
[  565.633758] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x24]=0x11 status 0x1
[  565.633760] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
[  565.633760] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2
[  565.633762] __ib_cache_gid_add: unable to add gid fe80:0000:0000:0000:e63d:1aff:fe88:3fa0 error=-14
[  565.633798] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x25]=0x11 status 0x1
[  565.633799] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
[  565.633800] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2
[  565.633802] __ib_cache_gid_add: unable to add gid fe80:0000:0000:0000:e63d:1aff:fe88:3fa0 error=-14[  589.542320] IPv6: ADDRCONF(NETDEV_UP): bond1.260: link is not ready
[  589.549555] IPv6: ADDRCONF(NETDEV_UP): bond1.270: link is not ready
[  589.573421] IPv6: ADDRCONF(NETDEV_UP): bond1.271: link is not ready
[  589.579052] IPv6: ADDRCONF(NETDEV_UP): bond1.272: link is not ready
[  589.584746] IPv6: ADDRCONF(NETDEV_UP): bond1.273: link is not ready
[  589.590109] IPv6: ADDRCONF(NETDEV_UP): bond1.280: link is not ready

And looking at the sysctls showed IPv6 disabled

$ grep "disable_ipv6" sosreport-worker-077-2023-12-05-cluzbsy/sos_commands/kernel/sysctl_-a
net.ipv6.conf.all.disable_ipv6 = 0
net.ipv6.conf.bond1.disable_ipv6 = 0
net.ipv6.conf.bond1/241.disable_ipv6 = 0
net.ipv6.conf.bond1/260.disable_ipv6 = 1
net.ipv6.conf.bond1/270.disable_ipv6 = 1
net.ipv6.conf.bond1/271.disable_ipv6 = 1
net.ipv6.conf.bond1/272.disable_ipv6 = 1
net.ipv6.conf.bond1/273.disable_ipv6 = 1
net.ipv6.conf.bond1/280.disable_ipv6 = 1

However, other worker nodes that were reporting fine had these same sysctl values so I would likely disregard.

Looking at one of the pods (I've attached the pod logs to this Jira)

$ oc get pod -n openshift-nmstate nmstate-handler-xxgvk -o wide 
NAME                    READY   STATUS    RESTARTS   AGE   IP               NODE                                    
nmstate-handler-xxgvk   1/1     Running   2          17d   204.151.100.90   worker-077.kub3-2.rch-mtce-1.vzwops.com

Ultimately, the fix for the 'ConfigurationAborted' was to delete the nmstate-handler pod on the nodes where it stated 'ConfigurationAborted'. After doing this, it corrected the STATUS automatically.

We're just trying to understand what prevented it from succeeding in the first place.

[1] https://issues.redhat.com/browse/OCPBUGS-11997
[2] https://issues.redhat.com/browse/OCPBUGS-11997?focusedId=22645417&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22645417
[3] https://bugzilla.redhat.com/show_bug.cgi?id=2000052

Version-Release number of selected component (if applicable):

OCP 4.12.29

How reproducible:

It happened in 3 different clusters after applying a machine-config that sets the system hostname

Steps to Reproduce:

For customer, it was to apply this machineconfig and then post-reboot some nodes were ConfigurationAborted but I'm not sure that's what caused it as only a few nodes affected: https://issues.redhat.com/browse/OCPBUGS-11997?focusedId=22645417&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22645417

Actual results:

Overall, the NNCP was degraded because 24/119 NNCE had been marked 'Aborted'

Expected results:

NNCP healthy as well as all NNCE

Additional info:

links to

RHBA-2024:1052 OpenShift Container Platform 4.12.z bug fix update

Assignee:: Mat Kowalski

Reporter:: Albert Cardenas

Need Info From:: None

Contributors:: None

QA Contact:: Qiong Wang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2023/12/06 5:11 PM

Updated:: 2025/09/13 11:48 PM

Resolved:: 2024/03/06 12:38 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide