-
Bug
-
Resolution: Unresolved
-
Undefined
-
CNV v4.20.0
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
None
-
-
None
Description of problem:
While testing 4.20 I can occasionally reproduce a state under high node densities where a handful of virt-launcher pods go from Running to NotReady state. In theses cases the NotReady pods are always co-located on the same worker node (although specific node can change from test to test) and the VMIs report the generic 10.0.2.2 ip, although a console shows the VM is up and otherwise running fine, it can ping out just fine, etc. Note this is reproducing even when running 4.19 crio version so likely not related to the crio change investigated in OCPBUGS-60605. # oc get pod -n virt-density -o wide | grep NotReady virt-launcher-virt-density-192-w6dfx 2/3 NotReady 0 3h23m 10.131.0.91 worker00 <none> 1/1 virt-launcher-virt-density-207-vp2ft 2/3 NotReady 0 3h23m 10.131.0.96 worker00 <none> 1/1 virt-launcher-virt-density-240-tswg4 2/3 NotReady 0 3h23m 10.131.0.108 worker00 <none> 1/1 virt-launcher-virt-density-279-v7wh9 2/3 NotReady 0 3h22m 10.131.0.120 worker00 <none> 1/1 virt-launcher-virt-density-297-s49xh 2/3 NotReady 0 3h22m 10.131.0.127 worker00 <none> 1/1 virt-launcher-virt-density-313-qtfn2 2/3 NotReady 0 3h22m 10.131.0.132 worker00 <none> 1/1 # oc get vmi -A | grep False virt-density virt-density-192 3h23m Running 10.0.2.2 worker00 False virt-density virt-density-207 3h23m Running 10.0.2.2 worker00 False virt-density virt-density-240 3h23m Running 10.0.2.2 worker00 False virt-density virt-density-279 3h23m Running 10.0.2.2 worker00 False virt-density virt-density-297 3h23m Running 10.0.2.2 worker00 False virt-density virt-density-313 3h23m Running 10.0.2.2 worker00 False Note we do have one known NotReady scenario during 4.19 mass migration testing, tracked in CNV-67948, but not sure if its related yet.
Version-Release number of selected component (if applicable):
OCP 4.20.0-ec.6, Virt 4.20.0-144
How reproducible:
Some runs of 200VMs per node are successful, some runs hit this NotReady error, usually for only ~5 pods or so each time in this environment.
Steps to Reproduce:
1. Start 200 VMs per node at once, check pod states
Actual results:
Not all pods stay in Running state
Expected results:
All pods stay in Running state
Additional info:
virt-density virt-launcher-virt-density-313-qtfn2 3/3 Running 0 2m24s 10.131.0.132 worker00 <none> 1/1 virt-density virt-launcher-virt-density-313-qtfn2 2/3 NotReady 0 3m23s 10.131.0.132 worker00 <none> 1/1 Worker00 is not too loaded in terms of resources: Resource Requests Limits -------- -------- ------ cpu 22346m (17%) 23125m (18%) memory 195895318784 (76%) 20100M (7%) # oc adm top node NAME CPU(cores) CPU(%) MEMORY(bytes) MEMORY(%) master-0 1335m 6% 11285Mi 23% master-1 4938m 25% 16352Mi 33% master-2 1367m 7% 13858Mi 28% worker00 2358m 1% 148680Mi 60% worker01 8426m 6% 154793Mi 63% worker02 4399m 3% 153805Mi 62% I tried rebooting the guest OS and the pod state did not change, but interestingly enough a virtctl migrate test on a VM worked and then the pod state was fine and remained in Running: virt-density virt-launcher-virt-density-313-ncmm7 3/3 Running 0 56s 10.128.2.238 worker02 <none> 1/1 virt-density virt-launcher-virt-density-313-qtfn2 0/3 Completed 0 3h27m 10.131.0.132 worker00 <none> 1/1