Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19020

Unstable node internal IP causes connection errors for KubeVirt platform

XMLWordPrintable

    • Critical
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      The HyperShift KubeVirt (openshift virtualization) platform has worker nodes that are hosted by KubeVirt virtual  machines. The worker node's internal IP address is interpreted by inspecting the kubevirt vmi's vmi.status.interface field.
      
      Due to the way the vmi.status.interface field sources its information from the qemu guest agent, that field is not guaranteed to remain static in some scenarios, such as soft reboot or when the qemu agent is temporarily unavailable. During these situations, the interfaces list will be empty.
      
      When the interfaces list is empty on the vmi, there are Hypershift related components (cloud-provider-kubevirt and cluster-api-provider-kubevirt) which strip the worker nodes internal IP. This stripping of the node's internal IP causes unpredictable behavior that results in connectivity failures from the KAS to the worker node kubelets.
      
      To address this, the Hypershift related kubevirt components need to only update the Internal IP of worker nodes when the vmi.status.interfaces list has an IP for the default interface. Othewise these hypershift components should use the last known internal IP address rather than stripping the internal IP address from the node.

      Version-Release number of selected component (if applicable):

      4.14

      How reproducible:

      100% given enough time and the right environment.

      Steps to Reproduce:

      1. create a hypershift kubevirt guest cluster
      2. run the csi conformance test suite in a loop (this test suite causes the vmi.status.interfaces list to become unstable briefly at times)
      

      Actual results:

      the csi test suite will eventually begin failing due to inabiilty to pod exec into worker node pods. This is caused by the node's internal IP being removed.

      Expected results:

      csi conformance should pass reliably

      Additional info:

       

            rhn-engineering-dvossel David Vossel
            rhn-engineering-dvossel David Vossel
            Yu Li (李宇) Yu Li (李宇)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: