Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-24524

Post-reboot some nodes NodeNetworkConfigurationEnactments report ConfigurationAborted

XMLWordPrintable

    • No
    • True
    • Hide

      Need to merge 4.14 first which is blocked because CI permafail

      Show
      Need to merge 4.14 first which is blocked because CI permafail
    • Release Note Not Required
    • In Progress

      Description of problem:

      Cluster was healthy but we needed to fix a bug [1] with NetworkManager and the hostname flapping so customer applied the machine config [2].

      After the nodes rebooted, we noticed a few of the NNCEs (24/119) showed ConfigurationAborted:

      $ oc get nodenetworkconfigurationenactments | awk '/ConfigurationAborted/ || /NAME/'
      NAME                                                       STATUS      REASON
      worker-076.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-077.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-079.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-081.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-082.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-083.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-084.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-085.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-086.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-087.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-088.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-089.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-090.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-091.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-092.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-093.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-094.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-097.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-100.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-101.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-102.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-103.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-104.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted
      worker-112.<redacted-clustername>.wsnmacvlanpolicy-bond1   Aborted     ConfigurationAborted

      A sosreport from one of those above workers (worker-077) showed that all those interfaces created by nmstate did in-fact exist

      $ less sosreport-worker-077-2023-12-05-cluzbsy/sos_commands/networking/ip_-d_address 
      30: bond1.260@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
          link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 
          vlan protocol 802.1Q id 260 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
      31: bond1.270@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
          link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 
          vlan protocol 802.1Q id 270 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
      32: bond1.271@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
          link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 
          vlan protocol 802.1Q id 271 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
      33: bond1.272@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
          link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 
          vlan protocol 802.1Q id 272 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
      34: bond1.273@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
          link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 
          vlan protocol 802.1Q id 273 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
      35: bond1.280@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
          link/ether b8:ce:f6:6f:f5:e8 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 0 maxmtu 65535 
          vlan protocol 802.1Q id 280 <REORDER_HDR> numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 

      Here's those interfaces expected to be configured

      $ oc get nncp
      NAME                     STATUS     REASON
      wsnmacvlanpolicy-bond1   Degraded   FailedToConfigure$ oc get nncp wsnmacvlanpolicy-bond1 -o yaml
      apiVersion: nmstate.io/v1
      kind: NodeNetworkConfigurationPolicy
      metadata:
        name: wsnmacvlanpolicy-bond1
      spec:
        desiredState:
          interfaces:
          - name: bond1.271
            state: up
            type: vlan
            vlan:
              base-iface: bond1
              id: 271
          - name: bond1.272
            state: up
            type: vlan
            vlan:
              base-iface: bond1
              id: 272
          - name: bond1.270
            state: up
            type: vlan
            vlan:
              base-iface: bond1
              id: 270
          - name: bond1.260
            state: up
            type: vlan
            vlan:
              base-iface: bond1
              id: 260
          - name: bond1.280
            state: up
            type: vlan
            vlan:
              base-iface: bond1
              id: 280
          - name: bond1.273
            state: up
            type: vlan
            vlan:
              base-iface: bond1
              id: 273
      status:
        conditions:
        - lastHeartbeatTime: "2023-11-29T22:14:42Z"
          lastTransitionTime: "2023-11-29T22:14:34Z"
          reason: FailedToConfigure
          status: "False"
          type: Available
        - lastHeartbeatTime: "2023-11-29T22:14:42Z"
          lastTransitionTime: "2023-11-29T22:14:34Z"
          message: 0/119 nodes failed to configure, 24 nodes aborted configuration
          reason: FailedToConfigure
          status: "True"
          type: Degraded
        - lastHeartbeatTime: "2023-11-29T22:14:42Z"
          lastTransitionTime: "2023-11-29T22:14:34Z"
          reason: ConfigurationProgressing
          status: "False"
          type: Progressing
        lastUnavailableNodeCountUpdate: "2023-11-17T15:06:48Z" 

      There was a thought that it might've been due to IPv6 bug [3] as this message was found on the node around boot time in dmesg:

      [  565.632488] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x20]=0x11 status 0x1
      [  565.632508] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
      [  565.632523] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2
      [  565.632538] __ib_cache_gid_add: unable to add gid 0000:0000:0000:0000:0000:ffff:cc97:5e5a error=-14
      [  565.632582] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x21]=0x11 status 0x1
      [  565.632595] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
      [  565.632608] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2
      [  565.632623] __ib_cache_gid_add: unable to add gid 0000:0000:0000:0000:0000:ffff:cc97:5e5a error=-14
      [  565.632712] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x22]=0x11 status 0x1
      [  565.632725] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
      [  565.632737] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2
      [  565.632751] __ib_cache_gid_add: unable to add gid 2600:40d0:0000:000e:00cd:0fd0:0000:00be error=-14
      [  565.632806] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x23]=0x11 status 0x1
      [  565.632818] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
      [  565.632831] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2
      [  565.632845] __ib_cache_gid_add: unable to add gid 2600:40d0:0000:000e:00cd:0fd0:0000:00be error=-14
      [  565.633758] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x24]=0x11 status 0x1
      [  565.633760] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
      [  565.633760] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2
      [  565.633762] __ib_cache_gid_add: unable to add gid fe80:0000:0000:0000:e63d:1aff:fe88:3fa0 error=-14
      [  565.633798] bnxt_en 0000:19:00.1: QPLIB: cmdq[0x25]=0x11 status 0x1
      [  565.633799] bnxt_en 0000:19:00.1 bnxt_re1: Failed to add GID: 0xfffffff2
      [  565.633800] infiniband bnxt_re1: add_roce_gid GID add failed port=1 index=2
      [  565.633802] __ib_cache_gid_add: unable to add gid fe80:0000:0000:0000:e63d:1aff:fe88:3fa0 error=-14[  589.542320] IPv6: ADDRCONF(NETDEV_UP): bond1.260: link is not ready
      [  589.549555] IPv6: ADDRCONF(NETDEV_UP): bond1.270: link is not ready
      [  589.573421] IPv6: ADDRCONF(NETDEV_UP): bond1.271: link is not ready
      [  589.579052] IPv6: ADDRCONF(NETDEV_UP): bond1.272: link is not ready
      [  589.584746] IPv6: ADDRCONF(NETDEV_UP): bond1.273: link is not ready
      [  589.590109] IPv6: ADDRCONF(NETDEV_UP): bond1.280: link is not ready 

      And looking at the sysctls showed IPv6 disabled

      $ grep "disable_ipv6" sosreport-worker-077-2023-12-05-cluzbsy/sos_commands/kernel/sysctl_-a
      net.ipv6.conf.all.disable_ipv6 = 0
      net.ipv6.conf.bond1.disable_ipv6 = 0
      net.ipv6.conf.bond1/241.disable_ipv6 = 0
      net.ipv6.conf.bond1/260.disable_ipv6 = 1
      net.ipv6.conf.bond1/270.disable_ipv6 = 1
      net.ipv6.conf.bond1/271.disable_ipv6 = 1
      net.ipv6.conf.bond1/272.disable_ipv6 = 1
      net.ipv6.conf.bond1/273.disable_ipv6 = 1
      net.ipv6.conf.bond1/280.disable_ipv6 = 1 

      However, other worker nodes that were reporting fine had these same sysctl values so I would likely disregard.

      Looking at one of the pods (I've attached the pod logs to this Jira)

      $ oc get pod -n openshift-nmstate nmstate-handler-xxgvk -o wide 
      NAME                    READY   STATUS    RESTARTS   AGE   IP               NODE                                    
      nmstate-handler-xxgvk   1/1     Running   2          17d   204.151.100.90   worker-077.kub3-2.rch-mtce-1.vzwops.com  

      Ultimately, the fix for the 'ConfigurationAborted' was to delete the nmstate-handler pod on the nodes where it stated 'ConfigurationAborted'. After doing this, it corrected the STATUS automatically.

      We're just trying to understand what prevented it from succeeding in the first place.

      [1] https://issues.redhat.com/browse/OCPBUGS-11997
      [2] https://issues.redhat.com/browse/OCPBUGS-11997?focusedId=22645417&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22645417
      [3] https://bugzilla.redhat.com/show_bug.cgi?id=2000052 

      Version-Release number of selected component (if applicable):

      OCP 4.12.29 

      How reproducible:

      It happened in 3 different clusters after applying a machine-config that sets the system hostname 

      Steps to Reproduce:

      For customer, it was to apply this machineconfig and then post-reboot some nodes were ConfigurationAborted but I'm not sure that's what caused it as only a few nodes affected: https://issues.redhat.com/browse/OCPBUGS-11997?focusedId=22645417&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22645417

      Actual results:

      Overall, the NNCP was degraded because 24/119 NNCE had been marked 'Aborted'

      Expected results:

      NNCP healthy as well as all NNCE

      Additional info:

          

            mkowalsk@redhat.com Mat Kowalski
            rhn-support-acardena Albert Cardenas
            Qiong Wang Qiong Wang
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: