Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-66215

OCP virt 4.20 unable to create 200 VMs per host

XMLWordPrintable

    • Quality / Stability / Reliability
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • Critical
    • Yes

      We (perf&scale) have a workload called virt-density that creates 200 VMs per host in a 6 worker environment. We have been running this workload daily since OCP 4.18.

      When we tried to onboard 4.20 we found that we are unable to complete the run.

      A lots of virt-launches pods have problems:

      [root@m42-h01-000-r760 ~]# oc get po -n virt-density |tail
      virt-launcher-virt-density-90-tpzz9 0/3 Error 0 173m
      virt-launcher-virt-density-91-nv6w2 3/3 Running 0 44m
      virt-launcher-virt-density-92-5q8d5 0/3 Init:ImageInspectError 0 15m
      virt-launcher-virt-density-93-7zg7b 0/3 Init:ImageInspectError 0 15m
      virt-launcher-virt-density-94-vr65v 0/3 Init:CreateContainerError 0 33m
      virt-launcher-virt-density-95-lxssd 3/3 Running 0 42m
      virt-launcher-virt-density-96-qs7pj 3/3 Running 1 3h9m
      virt-launcher-virt-density-97-flhtg 3/3 Running 0 4h16m
      virt-launcher-virt-density-98-bt4n5 0/3 Init:ImageInspectError 0 15m
      virt-launcher-virt-density-99-bfknn 1/3 Init:ImageInspectError 0 31m

      Looking into one of the pods:

      [root@m42-h01-000-r760 ~]# oc describe po virt-launcher-virt-density-92-5q8d5 | tail
      Events:
      Type Reason Age From Message
      ---- ------ ---- ---- -------
      Normal Scheduled 15m default-scheduler Successfully assigned virt-density/virt-launcher-virt-density-92-5q8d5 to m42-h09-000-r760
      Normal AddedInterface 15m multus Add eth0 [10.129.2.253/23] from ovn-kubernetes
      Warning Failed 11m (x2 over 13m) kubelet Error: context deadline exceeded
      Normal Pulled 10m (x3 over 15m) kubelet Container image "quay.io/openshift-cnv/container-native-virtualization-virt-launcher-rhel9@sha256:8f99b9bcab79ae7d5fe92f17efc700778fb195fe8c61341c56ac344545a396dc" already present on machine
      Warning Failed 8m33s kubelet Error: stream terminated by RST_STREAM with error code: CANCEL
      Warning InspectFailed 31s (x4 over 6m32s) kubelet Failed to inspect image "": rpc error: code = DeadlineExceeded desc = context deadline exceeded
      Warning Failed 31s (x4 over 6m32s) kubelet Error: ImageInspectError

      The CNV components seem to be failing during and after the test (they are healthy before):

      oc get pods | grep -v Run
      NAME READY STATUS RESTARTS AGE
      aaq-operator-744b9d7bf6-sb95f 0/1 CrashLoopBackOff 21 (3m43s ago) 53m
      cdi-apiserver-689cdf75f5-dg5m4 0/1 CrashLoopBackOff 12 (2m43s ago) 45m
      cdi-deployment-8474657b56-k5w2g 0/1 CrashLoopBackOff 12 (2m59s ago) 45m
      cluster-network-addons-operator-97556c5c9-7scfz 2/2 Terminating 0 7h32m
      hco-operator-7ccd979ccc-x6xts 0/1 CrashLoopBackOff 14 (5m7s ago) 53m
      hco-webhook-5fdcfd7b78-pb84r 0/1 CrashLoopBackOff 15 (3m51s ago) 53m
      hostpath-provisioner-operator-788c9697d6-6grwc 0/1 CrashLoopBackOff 21 (109s ago) 53m
      kubemacpool-cert-manager-6f94bfbfbd-mkrbm 1/1 Terminating 0 7h31m
      kubevirt-console-plugin-56bf7bd6fb-bwqh4 1/1 Terminating 0 7h30m
      kubevirt-ipam-controller-manager-7d6978d694-8dq57 1/1 Terminating 0 7h31m
      ssp-operator-589f9f4576-sk2cl 0/1 CrashLoopBackOff 16 (3m40s ago) 53m
      ssp-operator-6874fc8966-5tl4d 1/1 Terminating 1 (7h31m ago) 7h32m
      virt-handler-khxrp 0/1 CrashLoopBackOff 25 (5m7s ago) 7h30m
      virt-operator-74d9df7468-wj4pj 0/1 CrashLoopBackOff 19 (4m57s ago) 53m
      virt-operator-d687d8bcb-wq2g5 0/1 Terminating 0 3h20m
      virt-template-validator-748c84d6dc-zv2s9 1/1 Terminating 0 7h30m

      Looking into one of the pods:

      # oc describe po virt-handler-khxrp
      ...
      Events:
      Type Reason Age From Message
      ------ ---- ---- -------
      Warning ProbeError 172m (x18 over 15h) kubelet Liveness probe error: Get "https://10.128.2.16:8443/healthz": dial tcp 10.128.2.16:8443: connect: connection refused
      body:
      Warning ProbeError 148m (x95 over 17h) kubelet Readiness probe error: Get "https://10.128.2.16:8443/healthz": EOF
      body:
      Warning ProbeError 138m (x210 over 17h) kubelet Liveness probe error: Get "https://10.128.2.16:8443/healthz": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
      body:
      Warning ProbeError 93m (x251 over 17h) kubelet Liveness probe error: Get "https://10.128.2.16:8443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      body:
      Warning ProbeError 48m (x174 over 17h) kubelet Readiness probe error: Get "https://10.128.2.16:8443/healthz": dial tcp 10.128.2.16:8443: connect: connection refused
      body:
      Normal Created 40m (x173 over 21h) kubelet Created container: virt-handler
      Warning ProbeError 23m (x684 over 17h) kubelet Readiness probe error: Get "https://10.128.2.16:8443/healthz": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
      body:
      Warning BackOff 8m14s (x2156 over 15h) kubelet Back-off restarting failed container virt-handler in pod virt-handler-khxrp_openshift-cnv(65b84f8b-e952-4a41-b73f-4d05f0d07bb0)
      Normal Pulled 6m34s (x178 over 17h) kubelet Container image "quay.io/openshift-cnv/container-native-virtualization-virt-handler-rhel9@sha256:7a09454ddbcaef244b4ce67f82b67d26e602e1577cef8f8f1e2086bf82429a3a" already present on machine
      Normal Killing 4m39s (x179 over 17h) kubelet Container virt-handler failed liveness probe, will be restarted
      Warning ProbeError 3m17s (x717 over 17h) kubelet Readiness probe error: Get "https://10.128.2.16:8443/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
      

       

      Some additional context in case it's helpful:

              tnisan@redhat.com Tal Nisan
              jcastillolema Jose Castillo Lema
              Kedar Bidarkar Kedar Bidarkar
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated:
                Resolved: