Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-11891

Descheduling OpenShift Virtualization VMs using LowNodeUtilization results in unstable behavior

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • None
    • 4.12
    • descheduler
    • None
    • Important
    • No
    • Workloads Sprint 259
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Using the descheduler operator with the LowNodeUtilization strategy results in unstable/oscillatory behavior if the descheduler is used in this way to migrate VMs.
      
      When upgrading a cluster, or modifying node properties, the descheduler with LowNodeUtilization is necessary to achieve reasonable balance of long-lived pods (including VMs) following the upgrade, or at least one node will be essentially empty.  I was investigating appropriate descheduler settings for this operation, and determined that all settings that I tried resulted in essentially undamped oscillations, with VMs migrating from node to node.
      
      VMs consist of three components:
      * A VM object that is created by the user
      * A pod, created by the VM, that runs as a virt-launcher
      * A virtual machine instance (VMI) that actually represents the virtual machine.
      
      Unlike conventional pods, which when evicted are destroyed and re-created, VMs are migrated by default, using a live migration procedure that copies the contents of the old VM to a new one and destroys the old VM when complete.  The virt-launcher is subject to a PDB that ensures that it will not be destroyed by eviction.  When the descheduler (or node drain, or anything else) attempts to destroy a virt-launcher pod, the PDB prevents that.  However, the VM notices the attempt to evict the pod, and creates a new one (normally on a different node) to launch the new VMI.  The new VMI serves as a live migration target for the virtual machine; when the live migration is complete, the old pod is destroyed.
      
      The evidence I've collected indicates that the descheduler sees the failure to evict the VM and moves on to the next one.  Since all virt-launcher pods fail the eviction attempt initially, the descheduler evicts every one of them even if maxNoOfPodsPerNode is set to 1 and the resource reservation (of memory, pods, or CPU).  The virt-launcher pods are eventually evicted, and the node is emptied.  This will typically result in at least one other node exceeding the threshold, and the cycle repeats.

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      Create VMs that are descheduler-enabled (they aren't by default) with sufficient imbalance between nodes such that at least one node is above the upper threshold and one node is below the lower threshold.  This normally results in the VMs per node going into oscillations.

      Steps to Reproduce:

      1. git clone https://github.com/RobertKrawitz/OpenShift4-tools.git
      1a. git checkout clusterbuster-omnibus-20230407 (until this PR is merged)
      2. Install the descheduler using the DevPreviewLongLifecycle profile and an appropriate descheduler interval (I usually use 60 seconds for convenience, but even much longer intervals don't affect the outcome).
      3. Run (from top level in OpenShift4-tools) clusterbuster -f vmtest --requests=cpu=10m --requests=memory=8G --replicas=40 --wait-forever --pod-annotation='descheduler.alpha.kubernetes.io/evict: "true"' --remove-namespaces=0 --vm-migrate=1 --sync-in-first-namespace
      4. Drain one node (oc adm drain --delete-emptydir-data --ignore-daemonsets)
      5. Uncordon the node.
      6. Monitor the descheduler log and/or run from top level
      monitor-cluster-resources -P '*clusterbuster*' -R memory -i 5 -t -c -T -Q
      
      For step 3, the memory request and the replicas need to be chosen such that the total memory request will exceed 50% of the cluster capacity with a node cordoned.  In this example, I have four nodes each with 192 GB RAM, for a total of 768 GB, and 40 VMs of size 8 GB (320 GB total); this results in the cluster at about 40% capacity initially.  When I drain a node, I'm left with 576 GB; each node is then around 55% of capacity.  Uncordoning the node results in one node essentially empty (below the 20% lower threshold) and the other nodes above the 50% upper threshold.
      
      This works with other thresholds; the number and size of VMs simply has to be set accordingly.
      
      With conventional pods in a replicaset, the cluster achieves a reasonable state of balance.  If I apply a PDB to the replicaset with a maxUnavailable of 1, the descheduler takes longer to achieve balance since many of the evictions will fail, but the cluster does not go into oscillation.

      Actual results:

      Cluster goes into oscillations that do not damp out.

      Expected results:

      Cluster behaves as it does with pods.

      Additional info:
      Descheduler OCP+V analysis by robertkrawitz https://docs.google.com/document/d/1eYJplovpZCmaDOAaEcPp-_NzlyLKvyuQ-tBSEllwn6o/edit

       

            jchaloup@redhat.com Jan Chaloupka
            robertkrawitz Robert Krawitz
            Roni Kishner Roni Kishner
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

              Created:
              Updated:
              Resolved: