-
Bug
-
Resolution: Done
-
Critical
-
None
-
CNV v4.20.0
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
-
Critical
-
Yes
We (perf&scale) have a workload called virt-density that creates 200 VMs per host in a 6 worker environment. We have been running this workload daily since OCP 4.18.
When we tried to onboard 4.20 we found that we are unable to complete the run.
A lots of virt-launches pods have problems:
[root@m42-h01-000-r760 ~]# oc get po -n virt-density |tail virt-launcher-virt-density-90-tpzz9 0/3 Error 0 173m virt-launcher-virt-density-91-nv6w2 3/3 Running 0 44m virt-launcher-virt-density-92-5q8d5 0/3 Init:ImageInspectError 0 15m virt-launcher-virt-density-93-7zg7b 0/3 Init:ImageInspectError 0 15m virt-launcher-virt-density-94-vr65v 0/3 Init:CreateContainerError 0 33m virt-launcher-virt-density-95-lxssd 3/3 Running 0 42m virt-launcher-virt-density-96-qs7pj 3/3 Running 1 3h9m virt-launcher-virt-density-97-flhtg 3/3 Running 0 4h16m virt-launcher-virt-density-98-bt4n5 0/3 Init:ImageInspectError 0 15m virt-launcher-virt-density-99-bfknn 1/3 Init:ImageInspectError 0 31m
Looking into one of the pods:
[root@m42-h01-000-r760 ~]# oc describe po virt-launcher-virt-density-92-5q8d5 | tail Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 15m default-scheduler Successfully assigned virt-density/virt-launcher-virt-density-92-5q8d5 to m42-h09-000-r760 Normal AddedInterface 15m multus Add eth0 [10.129.2.253/23] from ovn-kubernetes Warning Failed 11m (x2 over 13m) kubelet Error: context deadline exceeded Normal Pulled 10m (x3 over 15m) kubelet Container image "quay.io/openshift-cnv/container-native-virtualization-virt-launcher-rhel9@sha256:8f99b9bcab79ae7d5fe92f17efc700778fb195fe8c61341c56ac344545a396dc" already present on machine Warning Failed 8m33s kubelet Error: stream terminated by RST_STREAM with error code: CANCEL Warning InspectFailed 31s (x4 over 6m32s) kubelet Failed to inspect image "": rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning Failed 31s (x4 over 6m32s) kubelet Error: ImageInspectError
The CNV components seem to be failing during and after the test (they are healthy before):
oc get pods | grep -v Run NAME READY STATUS RESTARTS AGE aaq-operator-744b9d7bf6-sb95f 0/1 CrashLoopBackOff 21 (3m43s ago) 53m cdi-apiserver-689cdf75f5-dg5m4 0/1 CrashLoopBackOff 12 (2m43s ago) 45m cdi-deployment-8474657b56-k5w2g 0/1 CrashLoopBackOff 12 (2m59s ago) 45m cluster-network-addons-operator-97556c5c9-7scfz 2/2 Terminating 0 7h32m hco-operator-7ccd979ccc-x6xts 0/1 CrashLoopBackOff 14 (5m7s ago) 53m hco-webhook-5fdcfd7b78-pb84r 0/1 CrashLoopBackOff 15 (3m51s ago) 53m hostpath-provisioner-operator-788c9697d6-6grwc 0/1 CrashLoopBackOff 21 (109s ago) 53m kubemacpool-cert-manager-6f94bfbfbd-mkrbm 1/1 Terminating 0 7h31m kubevirt-console-plugin-56bf7bd6fb-bwqh4 1/1 Terminating 0 7h30m kubevirt-ipam-controller-manager-7d6978d694-8dq57 1/1 Terminating 0 7h31m ssp-operator-589f9f4576-sk2cl 0/1 CrashLoopBackOff 16 (3m40s ago) 53m ssp-operator-6874fc8966-5tl4d 1/1 Terminating 1 (7h31m ago) 7h32m virt-handler-khxrp 0/1 CrashLoopBackOff 25 (5m7s ago) 7h30m virt-operator-74d9df7468-wj4pj 0/1 CrashLoopBackOff 19 (4m57s ago) 53m virt-operator-d687d8bcb-wq2g5 0/1 Terminating 0 3h20m virt-template-validator-748c84d6dc-zv2s9 1/1 Terminating 0 7h30m
Looking into one of the pods:
# oc describe po virt-handler-khxrp ... Events: Type Reason Age From Message ------ ---- ---- ------- Warning ProbeError 172m (x18 over 15h) kubelet Liveness probe error: Get "https://10.128.2.16:8443/healthz": dial tcp 10.128.2.16:8443: connect: connection refused body: Warning ProbeError 148m (x95 over 17h) kubelet Readiness probe error: Get "https://10.128.2.16:8443/healthz": EOF body: Warning ProbeError 138m (x210 over 17h) kubelet Liveness probe error: Get "https://10.128.2.16:8443/healthz": net/http: request canceled (Client.Timeout exceeded while awaiting headers) body: Warning ProbeError 93m (x251 over 17h) kubelet Liveness probe error: Get "https://10.128.2.16:8443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) body: Warning ProbeError 48m (x174 over 17h) kubelet Readiness probe error: Get "https://10.128.2.16:8443/healthz": dial tcp 10.128.2.16:8443: connect: connection refused body: Normal Created 40m (x173 over 21h) kubelet Created container: virt-handler Warning ProbeError 23m (x684 over 17h) kubelet Readiness probe error: Get "https://10.128.2.16:8443/healthz": net/http: request canceled (Client.Timeout exceeded while awaiting headers) body: Warning BackOff 8m14s (x2156 over 15h) kubelet Back-off restarting failed container virt-handler in pod virt-handler-khxrp_openshift-cnv(65b84f8b-e952-4a41-b73f-4d05f0d07bb0) Normal Pulled 6m34s (x178 over 17h) kubelet Container image "quay.io/openshift-cnv/container-native-virtualization-virt-handler-rhel9@sha256:7a09454ddbcaef244b4ce67f82b67d26e602e1577cef8f8f1e2086bf82429a3a" already present on machine Normal Killing 4m39s (x179 over 17h) kubelet Container virt-handler failed liveness probe, will be restarted Warning ProbeError 3m17s (x717 over 17h) kubelet Readiness probe error: Get "https://10.128.2.16:8443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Some additional context in case it's helpful:
- How we deploy the CNV operator: https://github.com/openshift/release/blob/master/ci-operator/step-registry/openshift-qe/installer/bm/day2/cnv/
- How we run the test: https://github.com/openshift/release/tree/master/ci-operator/step-registry/openshift-qe/virt-density
- is triggering
-
CNV-63338 [vme-perf] VM live migration to a specific node
-
- New
-