Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-29868

[2214838] Failed to restart VMI in cnv - Failed to terminate process Device or resource busy

XMLWordPrintable

    • Critical
    • None

      Description of problem:

      Attempt to restart VMI using OCS storage and hot plug disks results in pod 'terminating' indefinitely and virt-launcher "Failed to terminate process 68 with SIGTERM: Device or resource busy"

      Cannot kill pod and restart it, rendering the VM down.

      Version-Release number of selected component (if applicable):

      How reproducible:

      Not fully reproducible as it appears to be intermittent.

      Steps to Reproduce:

      Rebooted worker node in order to clear previously hung pods.

      Restart VMI multiple times.

      Actual results:

      After restarting VMI:

      1. oc get events:
        1h5m Normal Killing pod/hp-volume-kzn4h Stopping container hotplug-disk
        5m56s Normal Killing pod/virt-launcher-awrhdv500-qhs86 Stopping container compute
        5m56s Warning FailedKillPod pod/virt-launcher-awrhdv500-qhs86 error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "19db6081-760e-42f4-859e-fe2b79239275" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
        13m Warning FailedKillPod pod/virt-launcher-awrhdv500-qhs86 error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container 21e9a094da73c02fbf5e6c4c21b5892b7ae6a161e9ac79a358ebe1040fec826a: context deadline exceeded", failed to "KillPodSandbox" for "19db6081-760e-42f4-859e-fe2b79239275" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container for pod sandbox 3ca982adc188e0fd08c40f49a9d2593b476d66d6084142a0b24fa6de119df262: failed to stop container k8s_compute_virt-launcher-awrhdv500-qhs86_XXXX-os-images_19db6081-760e-42f4-859e-fe2b79239275_0: context deadline exceeded"]XX
        20m Warning FailedKillPod pod/virt-launcher-awrhdv500-qhs86 error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container 21e9a094da73c02fbf5e6c4c21b5892b7ae6a161e9ac79a358ebe1040fec826a: context deadline exceeded", failed to "KillPodSandbox" for "19db6081-760e-42f4-859e-fe2b79239275" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
      1. omg logs virt-launcher-awrhdv500-qhs86
        2023-06-13T16:35:31.597516920Z {"component":"virt-launcher","kind":"","level":"info","msg":"Signaled vmi shutdown","name":"awrhdv500","namespace":"XXX-os-images","pos":"server.go:311","timestamp":"2023-06-13T16:35:31.597448Z","uid":"42e57a49-2ade-4a26-ba02-b4d4adeb43bc"}

        2023-06-13T16:35:47.026189598Z

        {"component":"virt-launcher","level":"error","msg":"Failed to terminate process 68 with SIGTERM: Device or resource busy","pos":"virProcessKillPainfullyDelay:472","subcomponent":"libvirt","thread":"26","timestamp":"2023-06-13T16:35:47.025000Z"}

        2023-06-13T16:35:47.030654658Z

        {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 6 with reason 2 received","pos":"client.go:435","timestamp":"2023-06-13T16:35:47.030612Z"}

        2023-06-13T16:35:48.786034568Z

        {"component":"virt-launcher","level":"info","msg":"Grace Period expired, shutting down.","pos":"monitor.go:165","timestamp":"2023-06-13T16:35:48.785937Z"}

      Expected results:

      VM restarts cleanly.

      Additional info:

      This is the second time this has happened in a week. Previously we entered the node to find a defunct qemu-kvm process and the 'conmon' process still running:

      ps -ef | grep -i awrhdv500 ->
      conmon process still running

      sh-4.4# ps -ef | grep -i awrhdv500
      root 1297341 1292627 0 18:40 ? 00:00:00 grep -i awrhdv500
      root 3588286 1 0 Jun08 ? 00:00:00 /usr/bin/conmon -b /run/containers/storage/overlay-containers/d129e13673e5d2280e3a931f07c2f58e64d880da4ef615274e09553576a3a1c2/userdata -c d129e13673e5d2280e3a931f07c2f58e64d880da4ef615274e09553576a3a1c2 --exit-dir /var/run/crio/exits -l /var/log/pods/XXX-os-images_virt-launcher-awrhdv500-kvkf8_57ab818f-aea7-4a5c-ad20-46fdc5e547ee/compute/

      We tried to force delete the pod virt-launcher-awrhdv500-kvkf8. The pod was removed, but we still saw the /usr/bin/conmon -b /run/containers/storage/overlay-containers/ process running on the node. This process wasn't deleted.

      We then tried starting the vm awrhdv500 and it got stuck in a "Pending state".

      Only way we found to clear it was a reboot of the worker node.

              alitke@redhat.com Adam Litke
              shaselde@redhat.com Sean Haselden
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: