Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-63650

OCP4.18.3 - NNCP policy with captured names based on matched rules has errantly overrwritten and injected transient network profiles created by configure-ovs.sh (breaking existing network setup)

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      4.18
      ovnkube cluster
      
      Customer has the following base configurations applied on their nodes (exampled from a single host):
      
      We have defined on these nodes static bond interfaces:
      
      ~~~
      ls /etc/NetworkManager/system-connections/
      ens1f0.nmconnection, ens1f1.nmconnection, bond1.nmconnection 
      
      bond1.nmconnection is the PRIMARY iface used for the node, mode4, LACP
      ## see first comment below for full network configuration/logs
      ~~~
      
      A set of NNCP policies was applied (full context in first private comment below) that includes the following config:
      
      
      ~~~     
          name: lldp-enabled-bond
        spec:
          capture:
            bonds: interfaces.type=="bond"
            bonds-lldp: capture.bonds-up | interfaces.lldp.enabled:=true
            bonds-up: capture.bonds | interfaces.state=="up"
          desiredState:
            interfaces: '{{ capture.bonds-lldp.interfaces }}'
      
      ...
      
          name: lldp-enabled-bond-members
        spec:
          capture:
            bonds: interfaces.type=="bond"
            bonds-lldp: capture.bonds-up | interfaces.lldp.enabled:=true
            bonds-up: capture.bonds | interfaces.state=="up"
          desiredState:
            interfaces:
            - lldp:
                enabled: true
              name: '{{ capture.bonds-lldp.interfaces.0.link-aggregation.port.0 }}'
              state: up
            - lldp:
                enabled: true
              name: '{{ capture.bonds-lldp.interfaces.0.link-aggregation.port.1 }}'
              state: up
      ~~~
      
      It is observed that upon consuming this NNCP policy set, that the bond interface on the host is REMOVED and REPLACED with the following configuration files at /etc/NetworkManager/system-connections/:
      
      ~~~
      ens1f0-slave-ovs-clone.nmconnection, ens2f0-slave-ovs-clone.nmconnection, ovs-if-phys0.nmconnection
      ## see first comment for full detail + injected details.
      ## NOTE that ovs-if-phys0.nmconnection IS bond1 but with a LOT of additional garbage injected and invalid configuration values. We have blended existing iface bond1 with ovs-if-phys0 from br-ex configuration handling and merged both into a new file (supplanting the existing configuration)
      
      ovs-if-phys0.nmconnection
      [connection]
      id=ovs-if-phys0
      uuid=adc52518-24b6-4497-8f3e-6e51cf643597
      type=bond
      autoconnect-ports=1
      autoconnect-slaves=1
      controller=bond1
      interface-name=bond1 ##<---------
      lldp=1
      master=bond1
      port-type=ovs-port
      slave-type=ovs-port
      timestamp=1760596717
      ~~~
      
      
      //IMPACT:
      Nodes networking is CURRENTLY UP, because br-ex was not restarted. Bond1 has been torn down but exists only in run state because NetworkManager remains loaded and the nncp object did not bring down bond1, it brought down/up this new phys interface. As a result, br-ex is UP, but ONLY until the node is rebooted. IF a node is rebooted, networking is lost until it is manually remediated with network redeploy.
      
      Hundreds of nodes are in a very tenuous state - rebooting them will immediately degrade the host.

      Version-Release number of selected component (if applicable):

      4.18.z

      How reproducible:

       Multiple times - 50 clusters impacted without noticing but all of them moved to this problem configuration state. (Remediation underway)

      Steps to Reproduce:

          1. Create a bond interface
          2. Push NNCP policy to apply LLDP configuration based on match labels
          3. Observe network state is overwritten on a given host and subsequent reboots breaks networking config. 
          

      Actual results:

      Degraded cluster state and broken network config    

      Expected results:

      1.    LLDP configuration change based on matching rule should NOT collect /run/NetworkManager/system-connection ifaces from transient network configuration created by configure-ovs.sh, we should ONLY probe for connections managed in /etc/NetworkManager/system-connections as a valid target interface.
      
      2. It should not be possible to take transient network bridges and commit them with the modification to /etc/NetworkManager/system-connections. If this configuration selection logic is invalid or problematic, we should deny this source call or ensure our documentation is highly explicit about this type of pattern match. 

       

      Additional info:

      See first comment below. 

              bnemec@redhat.com Benjamin Nemec
              rhn-support-wrussell Will Russell
              None
              None
              Ross Brattain Ross Brattain
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: