Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-51302

Upgrade SR-IOV lane to supported stable version k8s-1.29/30/31

XMLWordPrintable

    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • ---
    • ---
    • None

      Why

      As Kubevirt support for 3 minor k8s stable versions (currently 29,30,31), 
      the sig-network SR-IOV lane is outdated and runs k8s-1.28.
      Upgrading the provider to newer stable version was a failure due to 
      compatibility issues on Kubevirt CI env using newer vesrions of kind v0.20.0+ 
      [1] [2].
      Long time ago there was an attempt to upgrade the SR-IOV provider newer kind 
      which also failed [3] [4].

       [1] https://github.com/kubevirt/kubevirtci/pull/1321
       [2] https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirtci/1321/check-up-kind-sriov/1856357396830490624
       [3] https://github.com/kubevirt/kubevirtci/pull/1122#issuecomment-1909502001
       [4] https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirtci/1122/check-up-kind-1.27-sriov/1750420778597224448


      The problem

      The failure manifests on cluster-up when we create kind cluster, in the 
      join worker nodes phase [1].
      It seem that the worker node that is about to join is not accessible by the 
      control-plane node at the network level.
      Creating single nodes cluster actually works, in fact this is how VGPU lane 
      run against kind custer with single node.


      Troubleshooting done so far:

      1. Reproduce the issue on local env - FAILED
        Issue wont reproduce on local env, using podman, cgroups-2v, cluster creation succeeds.
        Issue wont reproduce on local env, using docker, cgroups-2v, cluster creation succeeds.
      2. Run newer docker in CI - FAILED
        Running with newer prow bootstrap image that includes newer Docker v20.10.23:
            `quay.io/kubevirtci/golang-legacy:v20230829-f2e4ded`
        Failed with the same reason (on join nodes phase, target node node is inaccessible) [2].
      3. Run using podman in CI - FAILED
           Failed with the same reason (on join node phase, target node is inaccessible) [3].

      [1] https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirtci/1321/check-up-kind-sriov/1855965469882716160#1:build-log.txt%3A678
      [2] https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirtci/1321/check-up-kind-sriov/1856347258870566912
      [3] https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirtci/1321/check-up-kind-sriov/1856357396830490624


      Options for fixing KinD:

      1. cgroups-v2:
            Since the issue wont reproduce on env that runs cgroups-v2 we suspect this 
            is the root cause for the issue mention above.
            Kind 0.20.0 release notes [1] indicates some changes around cgroups
            compatibility:
              "In a future release kind node images will drop support for kind binaries
               without cgroupns=private (which is already the default on all cgroup v2 
               hosts, and cgroup v1 in kind v0.20.0)."
             
            In addition, browsing KinD Github issue related to crgoups, it raised by the
            maintainers that cgroupsv2 should be used.
            In light of the above, it seem kind v0.20.0+ doesn't play nice (or at all) 
            with cgroups-v1.
          
            Kubevirt CI env workloads cluster still runs cgroups-v1, not sure why.
            Since K8s and the ecosystem around move toward cgroups-v2 including Kubevirt, 
            we should switch to cgroups-v2 on CI as well.
            
            AI:
            1.1. Reproduce the issue on local env but using cgroups-v1 (similar to CI nodes), 
                 if the issue reproduce:
            1.2. Configure one CI node with SR-IOV HW with cgroups-v2
            1.3. Run the kind upgrade PR [2] on this node.
            1.4. If the issue wont reproduce on the cgroups-v2 node, configure the other 
                SR-IOV nodes and we are done.
      2. Reach Kind maintainers trough Github issue or slack
          Issue should include some details that require console into CI nodes [3], Slack: #kind
      3. Troubleshoot latest Kind on podman in CI
      • CI runs KinD providers on docker because it wont work using podman, the
          Reason is not clear and require additional investigation.
      • Newer versions of Podman switched to work with different network plugin.
          Try using different network plugin pasta, slirp4netns
      1. {+}Troubleshoot latest Kind on newer docker in CI
        {+}Starch option, we better invest in kind alternatives.

      [1] https://github.com/kubernetes-sigs/kind/releases/tag/v0.20.0 
      [2] https://github.com/kubevirt/kubevirtci/pull/1321
      [3] https://github.com/kubernetes-sigs/kind/blob/main/.github/ISSUE_TEMPLATE/bug-report.md


      Use KinD alternatives for the SR-IOV provider:

      1. SR-IOV devices emulations:
          Conduct a PoC for using SR-IOV interfaces emulation in VM based cluster nodes 
          for Kubevirt SR-IOV tests. This was already introduced to RHEL [3] and has an epic on CNV [4]
      2. k3d:
          Long time ago there was a successful PoC to transition the SR-IOV provider to 
          k3d [1] [2]
          Conduct a PoC for using k3d for Kubevirt SR-IOV tests.

       [1] https://k3d.io/v5.7.4/
       [2] https://github.com/kubevirt/kubevirtci/pull/972
      [3] https://issues.redhat.com/browse/RHEL-1308
      [4] https://issues.redhat.com/browse/CNV-33590


      In case we wont find Kind alternative that works:

      1. Test basic SR-IOV functionally:
          Creating Kind cluster with single node similar to VGPU lane.
          It means we drop SR-IOV migration & cross nodes connectivity tests in tire1 and relay 
          on D/S tier2 tests.
      2. Drop SR-IOV test entirely:
          Relay on D/S tire2 tests entirely.

      One big con of the above is that we lose feedback on SR-IOV related regressions.

              omergi@redhat.com Or Mergi
              omergi@redhat.com Or Mergi
              Nir Rozen Nir Rozen
              Shikha Jhala Shikha Jhala
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: