Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-68835

virt-launcher pods go NotReady during high scale node density testing

XMLWordPrintable

    • Quality / Stability / Reliability
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None

      Description of problem:

      While testing 4.20 I can occasionally reproduce a state under high node densities where a handful of virt-launcher pods go from Running to NotReady state. In theses cases the NotReady pods are always co-located on the same worker node (although specific node can change from test to test) and the VMIs report the generic 10.0.2.2 ip, although a console shows the VM is up and otherwise running fine, it can ping out just fine, etc. 
      Note this is reproducing even when running 4.19 crio version so likely not related to the crio change investigated in OCPBUGS-60605. 
      
      # oc get pod -n virt-density -o wide | grep NotReady
      virt-launcher-virt-density-192-w6dfx   2/3     NotReady   0          3h23m   10.131.0.91    worker00   <none>           1/1
      virt-launcher-virt-density-207-vp2ft   2/3     NotReady   0          3h23m   10.131.0.96    worker00   <none>           1/1
      virt-launcher-virt-density-240-tswg4   2/3     NotReady   0          3h23m   10.131.0.108   worker00   <none>           1/1
      virt-launcher-virt-density-279-v7wh9   2/3     NotReady   0          3h22m   10.131.0.120   worker00   <none>           1/1
      virt-launcher-virt-density-297-s49xh   2/3     NotReady   0          3h22m   10.131.0.127   worker00   <none>           1/1
      virt-launcher-virt-density-313-qtfn2   2/3     NotReady   0          3h22m   10.131.0.132   worker00   <none>           1/1
      
      # oc get vmi -A | grep False
      virt-density   virt-density-192   3h23m   Running   10.0.2.2       worker00   False
      virt-density   virt-density-207   3h23m   Running   10.0.2.2       worker00   False
      virt-density   virt-density-240   3h23m   Running   10.0.2.2       worker00   False
      virt-density   virt-density-279   3h23m   Running   10.0.2.2       worker00   False
      virt-density   virt-density-297   3h23m   Running   10.0.2.2       worker00   False
      virt-density   virt-density-313   3h23m   Running   10.0.2.2       worker00   False
      
      Note we do have one known NotReady scenario during 4.19 mass migration testing, tracked in CNV-67948, but not sure if its related yet.

      Version-Release number of selected component (if applicable):

      OCP 4.20.0-ec.6, Virt 4.20.0-144

      How reproducible:

      Some runs of 200VMs per node are successful, some runs hit this NotReady error, usually for only ~5 pods or so each time in this environment. 

      Steps to Reproduce:

      1. Start 200 VMs per node at once, check pod states
      

      Actual results:

      Not all pods stay in Running state

      Expected results:

      All pods stay in Running state

      Additional info:

      
      virt-density                                       virt-launcher-virt-density-313-qtfn2                              3/3     Running                     0             2m24s   10.131.0.132     worker00   <none>           1/1
      
      virt-density                                       virt-launcher-virt-density-313-qtfn2                              2/3     NotReady                    0             3m23s   10.131.0.132     worker00   <none>           1/1
      
      
      Worker00 is not too loaded in terms of resources:
      
        Resource                       Requests            Limits
        --------                       --------            ------
        cpu                            22346m (17%)        23125m (18%)
        memory                         195895318784 (76%)  20100M (7%)
      
      # oc adm top node
      NAME       CPU(cores)   CPU(%)   MEMORY(bytes)   MEMORY(%)
      master-0   1335m        6%       11285Mi         23%
      master-1   4938m        25%      16352Mi         33%
      master-2   1367m        7%       13858Mi         28%
      worker00   2358m        1%       148680Mi        60%
      worker01   8426m        6%       154793Mi        63%
      worker02   4399m        3%       153805Mi        62%
      
      I tried rebooting the guest OS and the pod state did not change, but interestingly enough a virtctl migrate test on a VM worked and then the pod state was fine and remained in Running:
      
      virt-density                                       virt-launcher-virt-density-313-ncmm7                              3/3     Running     0               56s     10.128.2.238     worker02   <none>           1/1
      virt-density                                       virt-launcher-virt-density-313-qtfn2                              0/3     Completed   0               3h27m   10.131.0.132     worker00   <none>           1/1

              ffossemo@redhat.com Federico Fossemo
              jhopper@redhat.com Jenifer Abrams
              Denys Shchedrivyi Denys Shchedrivyi
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: