- 
    Bug 
- 
    Resolution: Unresolved
- 
    Undefined 
- 
    None
- 
    4.18.z
- 
    None
- 
        Quality / Stability / Reliability
- 
        False
- 
        
- 
        None
- 
        Critical
- 
        None
- 
        None
- 
        None
- 
        None
- 
        None
- 
        None
- 
        None
- 
        None
- 
        None
- 
        None
- 
        None
- 
        None
Description of problem:
4.18
ovnkube cluster
Customer has the following base configurations applied on their nodes (exampled from a single host):
We have defined on these nodes static bond interfaces:
~~~
ls /etc/NetworkManager/system-connections/
ens1f0.nmconnection, ens1f1.nmconnection, bond1.nmconnection 
bond1.nmconnection is the PRIMARY iface used for the node, mode4, LACP
## see first comment below for full network configuration/logs
~~~
A set of NNCP policies was applied (full context in first private comment below) that includes the following config:
~~~     
    name: lldp-enabled-bond
  spec:
    capture:
      bonds: interfaces.type=="bond"
      bonds-lldp: capture.bonds-up | interfaces.lldp.enabled:=true
      bonds-up: capture.bonds | interfaces.state=="up"
    desiredState:
      interfaces: '{{ capture.bonds-lldp.interfaces }}'
...
    name: lldp-enabled-bond-members
  spec:
    capture:
      bonds: interfaces.type=="bond"
      bonds-lldp: capture.bonds-up | interfaces.lldp.enabled:=true
      bonds-up: capture.bonds | interfaces.state=="up"
    desiredState:
      interfaces:
      - lldp:
          enabled: true
        name: '{{ capture.bonds-lldp.interfaces.0.link-aggregation.port.0 }}'
        state: up
      - lldp:
          enabled: true
        name: '{{ capture.bonds-lldp.interfaces.0.link-aggregation.port.1 }}'
        state: up
~~~
It is observed that upon consuming this NNCP policy set, that the bond interface on the host is REMOVED and REPLACED with the following configuration files at /etc/NetworkManager/system-connections/:
~~~
ens1f0-slave-ovs-clone.nmconnection, ens2f0-slave-ovs-clone.nmconnection, ovs-if-phys0.nmconnection
## see first comment for full detail + injected details.
## NOTE that ovs-if-phys0.nmconnection IS bond1 but with a LOT of additional garbage injected and invalid configuration values. We have blended existing iface bond1 with ovs-if-phys0 from br-ex configuration handling and merged both into a new file (supplanting the existing configuration)
ovs-if-phys0.nmconnection
[connection]
id=ovs-if-phys0
uuid=adc52518-24b6-4497-8f3e-6e51cf643597
type=bond
autoconnect-ports=1
autoconnect-slaves=1
controller=bond1
interface-name=bond1 ##<---------
lldp=1
master=bond1
port-type=ovs-port
slave-type=ovs-port
timestamp=1760596717
~~~
//IMPACT:
Nodes networking is CURRENTLY UP, because br-ex was not restarted. Bond1 has been torn down but exists only in run state because NetworkManager remains loaded and the nncp object did not bring down bond1, it brought down/up this new phys interface. As a result, br-ex is UP, but ONLY until the node is rebooted. IF a node is rebooted, networking is lost until it is manually remediated with network redeploy.
Hundreds of nodes are in a very tenuous state - rebooting them will immediately degrade the host.
Version-Release number of selected component (if applicable):
4.18.z
How reproducible:
Multiple times - 50 clusters impacted without noticing but all of them moved to this problem configuration state. (Remediation underway)
Steps to Reproduce:
    1. Create a bond interface
    2. Push NNCP policy to apply LLDP configuration based on match labels
    3. Observe network state is overwritten on a given host and subsequent reboots breaks networking config. 
    
Actual results:
Degraded cluster state and broken network config
Expected results:
1. LLDP configuration change based on matching rule should NOT collect /run/NetworkManager/system-connection ifaces from transient network configuration created by configure-ovs.sh, we should ONLY probe for connections managed in /etc/NetworkManager/system-connections as a valid target interface. 2. It should not be possible to take transient network bridges and commit them with the modification to /etc/NetworkManager/system-connections. If this configuration selection logic is invalid or problematic, we should deny this source call or ensure our documentation is highly explicit about this type of pattern match.
Additional info:
See first comment below.
- relates to
- 
                    OCPBUGS-56392 br-ex interface missing after reboot of the worker node| 4.16 agent based installation -         
- ASSIGNED
 
-