Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55325

Cannot apply a consistent NodeNetworkConfigurationPolicy for VRF-Lite

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      VRF-Lite requires the cluster admin to attach an interface to an CUDN VRF. The most straightforward way to do so is using a NNCP like

      apiVersion: nmstate.io/v1
      kind: NodeNetworkConfigurationPolicy
      metadata:
        name: udn-test-vlan
      spec:
        desiredState:
          interfaces:
          - name: enp3s0
            state: up 
            controller: udn-test
      

      where "udn-test" is an existing VRF.

      However, the NNCP only applies successfully if the VRF already exists and fails without retries if it doesn't exist. There are concerns whether this is suitable for scenarios where convergence is expected, like reboots, scaling up nodes, restoring configuration backups, etc...

      Looking into a bit more detail, when a CUDN and the corresponding VRF are created, it becomes managed by NM:

      [connection]
      id=udn-test
      uuid=8adba5af-0294-4f2e-8683-241214d49d6b
      type=vrf
      autoconnect=false
      interface-name=udn-test
      timestamp=1745516597
      
      [vrf]
      table=1008
      
      [ipv4]
      method=disabled
      
      [ipv6]
      addr-gen-mode=default
      method=ignore
      
      [proxy]
      
      [.nmmeta]
      nm-generated=true
      volatile=true
      external=true
      

      Then when the NNCP above is applied, the existing NM configuration for the interface gets mutated to set it as that VRF port:

      [jcaamano@sdn-08 vfr-lite]$ ssh core@192.168.111.24 sudo cat /etc/NetworkManager/system-connections/enp3s0.nmconnection
      [connection]
      id=enp3s0
      uuid=90d0354f-94c0-4189-9ef7-f932b4dbaf2e
      type=ethernet
      controller=udn-test
      interface-name=enp3s0
      port-type=vrf
      timestamp=1745516266
      
      [ethernet]
      
      [ipv4]
      dhcp-client-id=mac
      dhcp-timeout=2147483647
      method=auto
      
      [ipv6]
      addr-gen-mode=eui64
      address1=fe80::2a3:1cff:fe61:7d60/64
      dhcp-duid=ll
      dhcp-iaid=mac
      dhcp-timeout=2147483647
      method=auto
      ra-timeout=2147483647
      
      [proxy]
      

      All is fine up until this point.

      Now as the node reboots, this happens:

      • VRF udn-test nor its profile exist
      • enp3s0 profile remains as is, configuring it as a port of udn-test VRF, however enp3s0 is actually not attached to the VRF as it doesn't exist.
      • eventually ovnk runs, creates the udn-test VRF, and enp3s0 is attached to it.
      • there is no apparent transition on the NNCP state

      So even though we expected potential problems on reboot, this actually works fine.

      However we can expect problems in node scale up (and other similar scenarios) since there is a chance the NNCP is applied before ovnk actually has the chance to create the VRF. In that case it NNCP will fail and remain in failed state and not actually apply the NM configuration changes needed to set the interface as port to the VRF on that node.

      Other alternatives are:

      • Create the VRF from the NNCP as well. This requires changes in ovn-k to either use predictable table ids for the VRFs or to fully give up ownership of the VRF and expect something else to create it.

      We need to understand:

      • If asking knmstate to retry the NNCP is the most reasonable way forward
      • If we should otherwise opt to make the changes in ovnk to give up ownership of the VRF in specific scenarios
      • If there are configuration alternatives with knmstate that can work better for us
      • If there is something else in knmstate that makes this work better for us than we actually expect (example: maybe knmstate waits for the node to be ready before applying configuration changes, and thus if ovnk creates VRFs as part of initial sync then we have a happens-before relationship between the two events).

              jcaamano@redhat.com Jaime Caamaño Ruiz
              jcaamano@redhat.com Jaime Caamaño Ruiz
              None
              None
              Ying Wang Ying Wang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: