-
Epic
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
no_imageLocaloity_for_VMIs
-
92% To Do, 0% In Progress, 8% Done
-
dev-ready, ux-ready
Goal
currently all of our VMI pods are scheduled with the default scheduler profile (if a pod doesn't specify a scheduler name, kube-apiserver will set it to default-scheduler) which includes the ImageLocality plugin: the ImageLocality plugin favors nodes that already have the container images that the Pod runs.
In the kubevirt case, the virt-launcher image is the same for all the VMIs and it should be always already available on all the nodes being used by virt-handler in the node labelling process.
Due to this, the benefits or the ImageLocality plugin for VMI scheduling is none.
On the other side, due to:
kubernetes/kubernetes#93488
the list of images reported by a node is capped to certain limit (the default is 50) so some nodes can randomly report they already have the virt-launcher image while others will not and this could lead to an unbalanced scheduling across nodes.
If we are able to bypass the ImageLocality plugin for our VMI pods, we can achieve a more uniform scheduling of VMIs over cluster nodes.
User Stories
- As a Cluster Admin, I want to have all of my VMIs scheduled uniformously across my nodes avoiding side effects of ImageLocality plugin
Non-Requirements
- List of things not included in this epic, to alleviate any doubt raised during the grooming process.
Notes
- see https://bugzilla.redhat.com/show_bug.cgi?id=1984442 that explain the issue with stats from a large cluster
- we are pretty sure that this issue is basically specific to CNV that is really a corner case for the ImageLocality plugin:
1. CNV is uses the same image for all VMs
2. the virt-launcher image exists on all workloads nodes, because of virt-handler
3. the virt-launcher image may appear or not on the node's images list that are capped to 50 entries (see: kubernetes/kubernetes#93488 ) , and this is random
No other products probably needs to use a single image (virt-launcher) many time for different workloads (VMs).
- in k8s the feature is documented here: https://kubernetes.io/docs/reference/scheduling/config/
We can now defineĀ custom scheduler profiles that extend the default one enabling or disabling specific plugins, in our case the ImageLocality one with something like:apiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration profiles: - schedulerName: kubevirt plugins: score: disabled: - name: ImageLocality
- in OCP KubeSchedulerConfiguration is not directly exposed to customers but it could be used only via Openshift Secondary Scheduler Operator, see: https://access.redhat.com/solutions/6955993 and https://docs.openshift.com/container-platform/4.12/nodes/scheduling/secondary_scheduler/nodes-secondary-scheduler-configuring.html#nodes-secondary-scheduler-configuring-console_secondary-scheduler-configuring
- this means that HCO should set a dependency on the OpenShift Secondary Scheduler Operator at the OLM level
- HCO should configure it with something like:
apiVersion: v1 kind: ConfigMap metadata: name: "secondary-scheduler-config" namespace: "openshift-secondary-scheduler-operator" data: "config.yaml": | apiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration profiles: - schedulerName: kubevirt plugins: score: disabled: - name: ImageLocality
and a CR referring the default OCP scheduler image (we can get it with: $ oc get pods -n openshift-kube-scheduler --selector=app=openshift-kube-scheduler -o json | jq ".items[0].spec.containers[0].image" --raw-output) alongside the customized configuration.
- reported upstream in https://github.com/kubevirt/kubevirt/issues/9570
kubevirt already allows defining VMS with:apiVersion: kubevirt.io/v1 kind: VirtualMachine metadata: name: vm-fedora spec: running: true template: spec: schedulerName: kubevirt
and .spec.template.spec.schedulerName is already enough to require a custom scheduler profile. The gap on kubevirt is probably just a mechanism to set a custom default for .spec.template.spec.schedulerName from virt-operator and we could have a fixed value set tehre from HCO (the VM owner will still be able to override it VM by VM).
- if we don't want to have a default for .spec.template.spec.schedulerName from virt-operator (set by HCO), we can still define it in our VM templates but this is not going to affect custom VMs
Done Checklist
Who | What | Reference |
---|---|---|
DEV | Upstream roadmap issue (or individual upstream PRs) | <link to GitHub Issue> |
DEV | Upstream documentation merged | <link to meaningful PR> |
DEV | gap doc updated | <name sheet and cell> |
DEV | Upgrade consideration | <link to upgrade-related test or design doc> |
DEV | CEE/PX summary presentation | label epic with cee-training and add a <link to your support-facingĀ preso> |
QE | Test plans in Polarion | <link or reference to Polarion> |
QE | Automated tests merged | <link or reference to automated tests> |
DOC | Downstream documentation merged | <link to meaningful PR> |