Goal

currently all of our VMI pods are scheduled with the default scheduler profile (if a pod doesn't specify a scheduler name, kube-apiserver will set it to default-scheduler) which includes the ImageLocality plugin: the ImageLocality plugin favors nodes that already have the container images that the Pod runs.
In the kubevirt case, the virt-launcher image is the same for all the VMIs and it should be always already available on all the nodes being used by virt-handler in the node labelling process.
Due to this, the benefits or the ImageLocality plugin for VMI scheduling is none.
On the other side, due to:
kubernetes/kubernetes#93488
the list of images reported by a node is capped to certain limit (the default is 50) so some nodes can randomly report they already have the virt-launcher image while others will not and this could lead to an unbalanced scheduling across nodes.

If we are able to bypass the ImageLocality plugin for our VMI pods, we can achieve a more uniform scheduling of VMIs over cluster nodes.

User Stories

As a Cluster Admin, I want to have all of my VMIs scheduled uniformously across my nodes avoiding side effects of ImageLocality plugin

Non-Requirements

List of things not included in this epic, to alleviate any doubt raised during the grooming process.

Notes

see https://bugzilla.redhat.com/show_bug.cgi?id=1984442 that explain the issue with stats from a large cluster
we are pretty sure that this issue is basically specific to CNV that is really a corner case for the ImageLocality plugin:
1. CNV is uses the same image for all VMs
2. the virt-launcher image exists on all workloads nodes, because of virt-handler
3. the virt-launcher image may appear or not on the node's images list that are capped to 50 entries (see: kubernetes/kubernetes#93488 ) , and this is random
No other products probably needs to use a single image (virt-launcher) many time for different workloads (VMs).

in k8s the feature is documented here: https://kubernetes.io/docs/reference/scheduling/config/
We can now define custom scheduler profiles that extend the default one enabling or disabling specific plugins, in our case the ImageLocality one with something like:
```
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles: 
  - schedulerName: kubevirt
    plugins: 
      score: 
        disabled: 
        - name: ImageLocality
```

in OCP KubeSchedulerConfiguration is not directly exposed to customers but it could be used only via Openshift Secondary Scheduler Operator, see: https://access.redhat.com/solutions/6955993 and https://docs.openshift.com/container-platform/4.12/nodes/scheduling/secondary_scheduler/nodes-secondary-scheduler-configuring.html#nodes-secondary-scheduler-configuring-console_secondary-scheduler-configuring
- this means that HCO should set a dependency on the OpenShift Secondary Scheduler Operator at the OLM level
- HCO should configure it with something like:
```
apiVersion: v1
kind: ConfigMap
metadata: 
  name: "secondary-scheduler-config"                  
  namespace: "openshift-secondary-scheduler-operator" 
data: 
  "config.yaml": |
    apiVersion: kubescheduler.config.k8s.io/v1
    kind: KubeSchedulerConfiguration                  
    profiles: 
      - schedulerName: kubevirt
        plugins:                                      
          score: 
            disabled: 
              - name: ImageLocality
```
  and a CR referring the default OCP scheduler image (we can get it with: $ oc get pods -n openshift-kube-scheduler --selector=app=openshift-kube-scheduler -o json | jq ".items[0].spec.containers[0].image" --raw-output) alongside the customized configuration.

reported upstream in https://github.com/kubevirt/kubevirt/issues/9570
kubevirt already allows defining VMS with:
```
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata: 
  name: vm-fedora
spec: 
  running: true
  template: 
    spec: 
      schedulerName: kubevirt
```
and .spec.template.spec.schedulerName is already enough to require a custom scheduler profile. The gap on kubevirt is probably just a mechanism to set a custom default for .spec.template.spec.schedulerName from virt-operator and we could have a fixed value set tehre from HCO (the VM owner will still be able to override it VM by VM).
if we don't want to have a default for .spec.template.spec.schedulerName from virt-operator (set by HCO), we can still define it in our VM templates but this is not going to affect custom VMs

Done Checklist

Who	What	Reference
DEV	Upstream roadmap issue (or individual upstream PRs)	<link to GitHub Issue>
DEV	Upstream documentation merged	<link to meaningful PR>
DEV	gap doc updated	<name sheet and cell>
DEV	Upgrade consideration	<link to upgrade-related test or design doc>
DEV	CEE/PX summary presentation	label epic with cee-training and add a <link to your support-facing preso>
QE	Test plans in Polarion	<link or reference to Polarion>
QE	Automated tests merged	<link or reference to automated tests>
DOC	Downstream documentation merged	<link to meaningful PR>

depends on

CNV-21642 Document custom scheduler for VMs

Closed

is related to

CNV-25477 [ contd ] Add an alert when pod scheduling might be imbalanced across nodes duo to too much images (review & merge)

Closed

relates to

CNV-21571 Report Alert on too many images

Closed

links to

Upstream link for fix ImageLocality plugin (if fixed, the epic can be obsoleted)

Details

Description

Goal

User Stories

Non-Requirements

Notes

Done Checklist

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates