Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-27857

Skip ImageLocality plugin in VMIs scheduling (setting a dependency on secondary-scheduler-operator)

XMLWordPrintable

    • no_imageLocaloity_for_VMIs
    • 92% To Do, 0% In Progress, 8% Done
    • dev-ready, ux-ready

      Goal

      currently all of our VMI pods are scheduled with the default scheduler profile (if a pod doesn't specify a scheduler name, kube-apiserver will set it to default-scheduler) which includes the ImageLocality plugin: the ImageLocality plugin favors nodes that already have the container images that the Pod runs.
      In the kubevirt case, the virt-launcher image is the same for all the VMIs and it should be always already available on all the nodes being used by virt-handler in the node labelling process.
      Due to this, the benefits or the ImageLocality plugin for VMI scheduling is none.
      On the other side, due to:
      kubernetes/kubernetes#93488
      the list of images reported by a node is capped to certain limit (the default is 50) so some nodes can randomly report they already have the virt-launcher image while others will not and this could lead to an unbalanced scheduling across nodes.

      If we are able to bypass the ImageLocality plugin for our VMI pods, we can achieve a more uniform scheduling of VMIs over cluster nodes.

      User Stories

      • As a Cluster Admin, I want to have all of my VMIs scheduled uniformously across my nodes avoiding side effects of ImageLocality plugin

      Non-Requirements

      • List of things not included in this epic, to alleviate any doubt raised during the grooming process.

      Notes

      • see https://bugzilla.redhat.com/show_bug.cgi?id=1984442 that explain the issue with stats from a large cluster
      • we are pretty sure that this issue is basically specific to CNV that is really a corner case for the ImageLocality plugin:
        1. CNV is uses the same image for all VMs
        2. the virt-launcher image exists on all workloads nodes, because of virt-handler
        3. the virt-launcher image may appear or not on the node's images list that are capped to 50 entries (see: kubernetes/kubernetes#93488 ) , and this is random
        No other products probably needs to use a single image (virt-launcher) many time for different workloads (VMs).
      • in k8s the feature is documented here: https://kubernetes.io/docs/reference/scheduling/config/
        We can now defineĀ custom scheduler profiles that extend the default one enabling or disabling specific plugins, in our case the ImageLocality one with something like:
        apiVersion: kubescheduler.config.k8s.io/v1
        kind: KubeSchedulerConfiguration
        profiles: 
          - schedulerName: kubevirt
            plugins: 
              score: 
                disabled: 
                - name: ImageLocality
        
      • reported upstream in https://github.com/kubevirt/kubevirt/issues/9570
        kubevirt already allows defining VMS with:
        apiVersion: kubevirt.io/v1
        kind: VirtualMachine
        metadata: 
          name: vm-fedora
        spec: 
          running: true
          template: 
            spec: 
              schedulerName: kubevirt
        

        and .spec.template.spec.schedulerName is already enough to require a custom scheduler profile. The gap on kubevirt is probably just a mechanism to set a custom default for .spec.template.spec.schedulerName from virt-operator and we could have a fixed value set tehre from HCO (the VM owner will still be able to override it VM by VM).

      • if we don't want to have a default for .spec.template.spec.schedulerName from virt-operator (set by HCO), we can still define it in our VM templates but this is not going to affect custom VMs

      Done Checklist

      Who What Reference
      DEV Upstream roadmap issue (or individual upstream PRs) <link to GitHub Issue>
      DEV Upstream documentation merged <link to meaningful PR>
      DEV gap doc updated <name sheet and cell>
      DEV Upgrade consideration <link to upgrade-related test or design doc>
      DEV CEE/PX summary presentation label epic with cee-training and add a <link to your support-facingĀ preso>
      QE Test plans in Polarion <link or reference to Polarion>
      QE Automated tests merged <link or reference to automated tests>
      DOC Downstream documentation merged <link to meaningful PR>

              Unassigned Unassigned
              unassigned_jira Unassigned
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: