Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5969

[Nutanix]No host has enough available memory for VM, machine stuck in Provisioning and machineset scale/delete cannot delete machines

XMLWordPrintable

    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      Description of problem:

      Nutanix machine without enough memory stuck in Provisioning and machineset scale/delete cannot work

      Version-Release number of selected component (if applicable):

      Server Version: 
      4.12.0
      4.13.0-0.nightly-2023-01-17-152326

      How reproducible:

      Always

      Steps to Reproduce:

      1. Install Nutanix Cluster 
      Template https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/tree/master/functionality-testing/aos-4_12/ipi-on-nutanix//versioned-installer
      master_num_memory: 32768
      worker_num_memory: 16384
      networkType: "OVNKubernetes"
      installer_payload_image: quay.io/openshift-release-dev/ocp-release:4.12.0-x86_64 2.
      3. Scale up the cluster worker machineset from 2 replicas to 40 replicas
      4. Install a Infra machinesets with 3 replicas, and a Workload machinesets with 1 replica
      Refer to this doc https://docs.openshift.com/container-platform/4.11/machine_management/creating-infrastructure-machinesets.html#machineset-yaml-nutanix_creating-infrastructure-machinesets  and config the following resource
      VCPU=16
      MEMORYMB=65536
      MEMORYSIZE=64Gi

      Actual results:

      1. The new infra machines stuck in 'Provisioning' status for about 3 hours.
      
      % oc get machines -A | grep Prov                                               
      openshift-machine-api   qili-nut-big-jh468-infra-48mdt      Provisioning                                      175m
      openshift-machine-api   qili-nut-big-jh468-infra-jnznv      Provisioning                                      175m
      openshift-machine-api   qili-nut-big-jh468-infra-xp7xb      Provisioning                                      175m
      
      2. Checking the Nutanix web console, I found 
      infra machine 'qili-nut-big-jh468-infra-jnznv' had the following msg
      "
      No host has enough available memory for VM qili-nut-big-jh468-infra-48mdt (8d7eb6d6-a71e-4943-943a-397596f30db2) that uses 4 vCPUs and 65536MB of memory. You could try downsizing the VM, increasing host memory, power off some VMs, or moving the VM to a different host. Maximum allowable VM size is approximately 17921 MB
      "
      
      infra machine 'qili-nut-big-jh468-infra-jnznv' is not round
      
      infra machine 'qili-nut-big-jh468-infra-xp7xb' is in green without warning.
      But In must gather I found some error:
      03:23:49openshift-machine-apinutanixcontrollerqili-nut-big-jh468-infra-xp7xbFailedCreateqili-nut-big-jh468-infra-xp7xb: reconciler failed to Create machine: failed to update machine with vm state: qili-nut-big-jh468-infra-xp7xb: failed to get node qili-nut-big-jh468-infra-xp7xb: Node "qili-nut-big-jh468-infra-xp7xb" not found
      
      3. Scale down the worker machineset from 40 replicas to 30 replicas can not work. Still have 40 Running worker machines and 40 Ready nodes after about 3 hours.
      
      % oc get machinesets -A
      NAMESPACE               NAME                          DESIRED   CURRENT   READY   AVAILABLE   AGE
      openshift-machine-api   qili-nut-big-jh468-infra      3         3                             176m
      openshift-machine-api   qili-nut-big-jh468-worker     30        30        30      30          5h1m
      openshift-machine-api   qili-nut-big-jh468-workload   1         1                             176m
      
      % oc get machines -A | grep worker| grep Running -c
      40
      
      % oc get nodes | grep worker | grep Ready -c
      40
      
      4. I delete the infra machineset, but the machines still in Provisioning status and won't get deleted
      
      % oc delete machineset -n openshift-machine-api   qili-nut-big-jh468-infra
      machineset.machine.openshift.io "qili-nut-big-jh468-infra" deleted
      
      % oc get machinesets -A
      NAMESPACE               NAME                          DESIRED   CURRENT   READY   AVAILABLE   AGE
      openshift-machine-api   qili-nut-big-jh468-worker     30        30        30      30          5h26m
      openshift-machine-api   qili-nut-big-jh468-workload   1         1                             3h21m
      
      % oc get machines -A | grep -v Running
      NAMESPACE               NAME                                PHASE          TYPE   REGION    ZONE              AGE
      openshift-machine-api   qili-nut-big-jh468-infra-48mdt      Provisioning                                      3h22m
      openshift-machine-api   qili-nut-big-jh468-infra-jnznv      Provisioning                                      3h22m
      openshift-machine-api   qili-nut-big-jh468-infra-xp7xb      Provisioning                                      3h22m
      openshift-machine-api   qili-nut-big-jh468-workload-qdkvd                                                     3h22m

      Expected results:

      The new infra machines should be either Running or Failed.
      Cluster worker machinest scaleup and down should not be impacted.

      Additional info:

      must-gather download url will be added to the comment.

              yanhli@redhat.com Yanhua Li
              rhn-support-qili Qiujie Li
              Huali Liu Huali Liu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: