-
Story
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
0.42
-
False
-
-
False
-
None
-
---
-
---
-
-
None
Why
As Kubevirt support for 3 minor k8s stable versions (currently 29,30,31),
the sig-network SR-IOV lane is outdated and runs k8s-1.28.
Upgrading the provider to newer stable version was a failure due to
compatibility issues on Kubevirt CI env using newer vesrions of kind v0.20.0+
[1] [2].
Long time ago there was an attempt to upgrade the SR-IOV provider newer kind
which also failed [3] [4].
[1] https://github.com/kubevirt/kubevirtci/pull/1321
[2] https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirtci/1321/check-up-kind-sriov/1856357396830490624
[3] https://github.com/kubevirt/kubevirtci/pull/1122#issuecomment-1909502001
[4] https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirtci/1122/check-up-kind-1.27-sriov/1750420778597224448
The problem
The failure manifests on cluster-up when we create kind cluster, in the
join worker nodes phase [1].
It seem that the worker node that is about to join is not accessible by the
control-plane node at the network level.
Creating single nodes cluster actually works, in fact this is how VGPU lane
run against kind custer with single node.
Troubleshooting done so far:
- Reproduce the issue on local env - FAILED
Issue wont reproduce on local env, using podman, cgroups-2v, cluster creation succeeds.
Issue wont reproduce on local env, using docker, cgroups-2v, cluster creation succeeds. - Run newer docker in CI - FAILED
Running with newer prow bootstrap image that includes newer Docker v20.10.23:
`quay.io/kubevirtci/golang-legacy:v20230829-f2e4ded`
Failed with the same reason (on join nodes phase, target node node is inaccessible) [2]. - Run using podman in CI - FAILED
Failed with the same reason (on join node phase, target node is inaccessible) [3].
[1] https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirtci/1321/check-up-kind-sriov/1855965469882716160#1:build-log.txt%3A678
[2] https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirtci/1321/check-up-kind-sriov/1856347258870566912
[3] https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirtci/1321/check-up-kind-sriov/1856357396830490624
Options for fixing KinD:
- cgroups-v2:
Since the issue wont reproduce on env that runs cgroups-v2 we suspect this
is the root cause for the issue mention above.
Kind 0.20.0 release notes [1] indicates some changes around cgroups
compatibility:
"In a future release kind node images will drop support for kind binaries
without cgroupns=private (which is already the default on all cgroup v2
hosts, and cgroup v1 in kind v0.20.0)."
In addition, browsing KinD Github issue related to crgoups, it raised by the
maintainers that cgroupsv2 should be used.
In light of the above, it seem kind v0.20.0+ doesn't play nice (or at all)
with cgroups-v1.
Kubevirt CI env workloads cluster still runs cgroups-v1, not sure why.
Since K8s and the ecosystem around move toward cgroups-v2 including Kubevirt,
we should switch to cgroups-v2 on CI as well.
AI:
1.1. Reproduce the issue on local env but using cgroups-v1 (similar to CI nodes),
if the issue reproduce:
1.2. Configure one CI node with SR-IOV HW with cgroups-v2
1.3. Run the kind upgrade PR [2] on this node.
1.4. If the issue wont reproduce on the cgroups-v2 node, configure the other
SR-IOV nodes and we are done. - Reach Kind maintainers trough Github issue or slack
Issue should include some details that require console into CI nodes [3], Slack: #kind - Troubleshoot latest Kind on podman in CI
- CI runs KinD providers on docker because it wont work using podman, the
Reason is not clear and require additional investigation. - Newer versions of Podman switched to work with different network plugin.
Try using different network plugin pasta, slirp4netns
- {+}Troubleshoot latest Kind on newer docker in CI
{+}Starch option, we better invest in kind alternatives.
[1] https://github.com/kubernetes-sigs/kind/releases/tag/v0.20.0
[2] https://github.com/kubevirt/kubevirtci/pull/1321
[3] https://github.com/kubernetes-sigs/kind/blob/main/.github/ISSUE_TEMPLATE/bug-report.md
Use KinD alternatives for the SR-IOV provider:
- SR-IOV devices emulations:
Conduct a PoC for using SR-IOV interfaces emulation in VM based cluster nodes
for Kubevirt SR-IOV tests. This was already introduced to RHEL [3] and has an epic on CNV [4] - k3d:
Long time ago there was a successful PoC to transition the SR-IOV provider to
k3d [1] [2]
Conduct a PoC for using k3d for Kubevirt SR-IOV tests.
[1] https://k3d.io/v5.7.4/
[2] https://github.com/kubevirt/kubevirtci/pull/972
[3] https://issues.redhat.com/browse/RHEL-1308
[4] https://issues.redhat.com/browse/CNV-33590
In case we wont find Kind alternative that works:
- Test basic SR-IOV functionally:
Creating Kind cluster with single node similar to VGPU lane.
It means we drop SR-IOV migration & cross nodes connectivity tests in tire1 and relay
on D/S tier2 tests. - Drop SR-IOV test entirely:
Relay on D/S tire2 tests entirely.
One big con of the above is that we lose feedback on SR-IOV related regressions.