Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20353

NetworkManager fails on RHEL 8.6 worker configuring br-ex interface post reboot

    • Moderate
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Customer Escalated

      Description of problem:

      Upon node restart NetworkManager fails to activate br-ex interface. Bond has been configured using NMState operator.

      Version-Release number of selected component (if applicable):

      OCP 4.12 with RHEL 8.6 worker nodes

      How reproducible:

      Upon node reboot

      Steps to Reproduce:

      1. Created bond using ens1f0 + ens1f1 = bond0 
      2. Reboot the node.
      3. Below visible on affected node.
      
      [root@bl9261 ticams003]# ll /etc/sysconfig/network-scripts/*
      -rw-r--r--. 1 root root 347 May  8 13:54 /etc/sysconfig/network-scripts/ifcfg-bond0
      -rw-r--r--. 1 root root 453 Jun  2 16:03 /etc/sysconfig/network-scripts/ifcfg-bond0.3200
      -rw-r--r--. 1 root root 146 May  8 13:54 /etc/sysconfig/network-scripts/ifcfg-ens1f0
      -rw-r--r--. 1 root root 146 May  8 13:54 /etc/sysconfig/network-scripts/ifcfg-ens1f1 

      Actual results:

      Below is visible in boot logs 
      Aug 25 12:09:44 bl9263 NetworkManager[3388]: <info>  [1692965384.3619] device (br-ex): state change: deactivating -> unmanaged (reason 'removed', sys-iface-state: 'managed')
      Aug 25 12:09:44 bl9263 NetworkManager[3388]: <info>  [1692965384.3625] device (bond0.3200): state change: activated -> deactivating (reason 'unmanaged', sys-iface-state: 'managed')
      Aug 25 12:09:44 bl9263 NetworkManager[3388]: <info>  [1692965384.3648] device (bond0.3200): state change: deactivating -> unmanaged (reason 'removed', sys-iface-state: 'managed')
      Aug 25 12:09:44 bl9263 NetworkManager[3388]: <info>  [1692965384.3652] device (br-ex): state change: activated -> deactivating (reason 'unmanaged', sys-iface-state: 'managed')
      Aug 25 12:09:44 bl9263 NetworkManager[3388]: <info>  [1692965384.3663] device (br-ex): state change: deactivating -> unmanaged (reason 'removed', sys-iface-state: 'managed')
      Aug 25 12:09:44 bl9263 NetworkManager[3388]: <info>  [1692965384.3663] device (br-ex): detaching ovs interface br-ex
      Aug 25 12:09:44 bl9263 NetworkManager[3388]: <info>  [1692965384.3673] device (br-ex): state change: activated -> deactivating (reason 'unmanaged', sys-iface-state: 'managed')

      Expected results:

      Node networking being normal as br-ex gets activated.

      Additional info:

      As a workaround restarting NetworkManager helps get the br-ex interface back & node network is back again.

            [OCPBUGS-20353] NetworkManager fails on RHEL 8.6 worker configuring br-ex interface post reboot

            Tim Rozet added a comment -

            Tim Rozet added a comment - Dupe of https://issues.redhat.com/browse/OCPBUGS-18716  

            Tim Rozet added a comment -

            Confirmed with customer this is a duplicate of: https://issues.redhat.com/browse/OCPBUGS-18716

             

            The bond is changing mac addresses. The fix will be handled as part of 18716.

            Tim Rozet added a comment - Confirmed with customer this is a duplicate of: https://issues.redhat.com/browse/OCPBUGS-18716   The bond is changing mac addresses. The fix will be handled as part of 18716.

            Alex Alecu (Inactive) added a comment - - edited

            then why i have 3 other clusters that function correctly with the same ifcfg scripts ?

            also, bug problem statement is not correct, br-ex gets created every time, ovn-k8s-mp0  does not get created from time to time and thus this issue

            Alex Alecu (Inactive) added a comment - - edited then why i have 3 other clusters that function correctly with the same ifcfg scripts ? also, bug problem statement is not correct, br-ex gets created every time, ovn-k8s-mp0  does not get created from time to time and thus this issue

            Mat Kowalski added a comment - - edited

            The At least one reason for the race you are hitting is using ifcfg scripts and nmconnection files at the same time. This is not going to work. You need to move your ifcfg scripts to nmconnection files as the latter is what kubernetes-nmstate is using.

            Once this is done and you are still having problems, we can have a deeper look

            Mat Kowalski added a comment - - edited The At least one reason for the race you are hitting is using ifcfg scripts and nmconnection files at the same time. This is not going to work. You need to move your ifcfg scripts to nmconnection files as the latter is what kubernetes-nmstate is using. Once this is done and you are still having problems, we can have a deeper look

            mkowalsk@redhat.com, indeed our baremetal deployment uses ifcfg scripts. We use the same type of baremetal deployment in our 4 clusters, three of them do not have this issue.

            I understand we should transform everything in nmconnection but I'd like to tackle our issue first.
            What we noticed that the OVN bridge and ports do not start when the issue manifests (noticed in NetworkManager logs).

            Networking is stuck in half way where: br-ex was configured, bond0.3200 was added to the bridge and the IP address was moved from the bond subinterface to the bridge. What we don't see is device ovn-k8s-mp0 being created.

            We'll setup traces and post more info after.

            Alex Alecu (Inactive) added a comment - mkowalsk@redhat.com , indeed our baremetal deployment uses ifcfg scripts. We use the same type of baremetal deployment in our 4 clusters, three of them do not have this issue. I understand we should transform everything in nmconnection but I'd like to tackle our issue first. What we noticed that the OVN bridge and ports do not start when the issue manifests (noticed in NetworkManager logs). Networking is stuck in half way where: br-ex was configured, bond0.3200 was added to the bridge and the IP address was moved from the bond subinterface to the bridge. What we don't see is device ovn-k8s-mp0 being created. We'll setup traces and post more info after.

            First of all, in order for us to provide any more information about this we need to have TRACE logs from NetworkManager (https://access.redhat.com/solutions/7006538).

            Secondly, we can see that the nodes are mixing nmconnection files and ifcfg scripts, i.e. I can see `/etc/sysconfig/network-scripts/ifcfg-bond0.3200` that was created manually (or via customer's automation) as well as `/etc/NetworkManager/system-connections/bond0.3244.nmconnection` which are result of NNCP applied. This is not compatible to use both methods at the same time, a solution here is to migrate all `ifcfg-*` scripts to nmconnection files. Having both ways of configuring interfaces at the same time is asking for trouble, this should be fixed first before we try any other method.

            Mat Kowalski added a comment - First of all, in order for us to provide any more information about this we need to have TRACE logs from NetworkManager ( https://access.redhat.com/solutions/7006538 ). Secondly, we can see that the nodes are mixing nmconnection files and ifcfg scripts, i.e. I can see `/etc/sysconfig/network-scripts/ifcfg-bond0.3200` that was created manually (or via customer's automation) as well as `/etc/NetworkManager/system-connections/bond0.3244.nmconnection` which are result of NNCP applied. This is not compatible to use both methods at the same time, a solution here is to migrate all `ifcfg-*` scripts to nmconnection files. Having both ways of configuring interfaces at the same time is asking for trouble, this should be fixed first before we try any other method.

            Is the customization they are performing supported ?

            Kenny Tordeurs added a comment - Is the customization they are performing supported ?

            Additional information that might be important that the customer shared:
            3 clusters that don't have this issue and one cluster that has it.
            It must be a configuration parameter issue.

            Kenny Tordeurs added a comment - Additional information that might be important that the customer shared: 3 clusters that don't have this issue and one cluster that has it. It must be a configuration parameter issue.

              trozet@redhat.com Tim Rozet
              rhn-support-adubey Akash Dubey
              Qiong Wang Qiong Wang
              Akash Dubey
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: