Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-77759

UDN DHCP outage on hosted controlplane / hypershift clusters

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • x86_64
    • Production
    • None
    • None
    • None
    • Customer Escalated
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Cluster config: ROKS 4.20 with OVN default CNI.

      Observation: UDN attached CNV VMs loose their IP address and DHCP availability during cluster master operation (tested with patch version update).

      Version-Release number of selected component (if applicable):

      ROKS 4.20.12.

      CNV 4.20.3

      How reproducible: 100%

      Steps to Reproduce:

      0. ROKS cluster with master version 4.20.12.

      1. Create a namespace with primary UDN label. In the example, the namespace name is `green`

      2. Create a layer2 UDN with the following NAD:
      ```

      kind: NetworkAttachmentDefinition
      metadata:
        name: green-net
        namespace: green
      spec:
        config: '{"allowPersistentIPs":true,"cniVersion":"1.0.0","joinSubnet":"100.65.0.0/16,fd99::/64","name":"cluster_udn_green-net","netAttachDefName":"green/green-net","role":"primary","subnets":"10.203.0.0/26","topology":"layer2","type":"ovn-k8s-cni-overlay"}'
      ```

      3. Install CNV 4.20.3 and create some Centos VMs in the `green` namespace in such a way that it is attached to the UDN. Verify VM is getting the IP from the UDN over DHCP without any issue. 

      ```

      $ k get vmi -A                                            
      NAMESPACE   NAME       AGE   PHASE     IP            NODENAME                                                READY
      green       example1   69m   Running   10.203.0.12   test-d6is4ks20q37r7i27big-gergo03022-default-00000244   True
      green       example2   69m   Running   10.203.0.13   test-d6is4ks20q37r7i27big-gergo03022-default-0000036a   True
      green       example3   61m   Running   10.203.0.44   test-d6is4ks20q37r7i27big-gergo03022-default-00000244   True
      green       example4   58m   Running   10.203.0.46   test-d6is4ks20q37r7i27big-gergo03022-default-0000036a   True
      green       example5   57m   Running   10.203.0.47   test-d6is4ks20q37r7i27big-gergo03022-default-0000036a   True

      ```

      4. Update the master to a new patch level (in this case the target level is 4.20.13.). Wait until DHCP lease expires in the Centos VMs (probably 30 minutes).

      5. Check the attached IPs again.

      ```

      $ k get vmi -A                                               
      NAMESPACE   NAME       AGE    PHASE     IP                      NODENAME                                                READY
      green       example1   141m   Running                           test-d6is4ks20q37r7i27big-gergo03022-default-00000244   True
      green       example2   141m   Running   fe80::858:aff:fecb:d    test-d6is4ks20q37r7i27big-gergo03022-default-0000036a   True
      green       example3   133m   Running                           test-d6is4ks20q37r7i27big-gergo03022-default-00000244   True
      green       example4   129m   Running   fe80::858:aff:fecb:2e   test-d6is4ks20q37r7i27big-gergo03022-default-0000036a   True
      green       example5   129m   Running   fe80::858:aff:fecb:2f   test-d6is4ks20q37r7i27big-gergo03022-default-0000036a   True

      ```

      Actual results: VMs loose their DHCP assigned IPv4 addresses and only link-local v6 addresses are present. VM logs shows unavailable DHCP service.

      Expected results: DHCP is not disrupted.

      Additional info:

      I am not sure if this issue is specific to this master patch update. It might be that this is just one way to trigger it.

      VM guest OS restart does not recover the issue. Any DHCP client restart also does not recover.

      Given the VMI is owned by a VM object, deleting the VMI is resolving the issue, as new VMI comes up.

      Attaching (the same) static IP on the guest instead using DHCP client also works, suggesting the datapath in OVS is generally available, maybe only DHCP is affected. Pods attached to the same UDN are also not disrupted, as they do not use DHCP.

      Affected Platforms:

      Is it an issue present and reproducible with any IBM Cloud managed Openshift (ROKS) cluster with OVN as the default CNI.

              mduarted@redhat.com Miguel Duarte de Mora Barroso
              gergo.huszty@ibm.com Gergo Huszty
              Yoss Segev Yoss Segev
              None
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: