Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-36682

VMI deletion significantly slower on 4.15

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Major Major
    • CNV v4.16.0
    • CNV v4.15.0
    • CNV Virtualization
    • None
    • Quality / Stability / Reliability
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • Moderate
    • No

      Description of problem:

      We see a surge of failing tests across all kubevirt lanes where a VMI fails to delete within a timeout of 2 minutes:
      https://main-jenkins-csb-cnvqe.apps.ocp-c1.prod.psi.redhat.com/job/test-kubevirt-cnv-4.15-network-ovn-ocs/172/testReport/(root)/Tests%20Suite/_sig_network__Services_Masquerade_interface_binding__without__a_service_matching_the_vmi_exposed_should_fail_to_reach_the_vmi/
      https://main-jenkins-csb-cnvqe.apps.ocp-c1.prod.psi.redhat.com/job/test-kubevirt-cnv-4.15-compute-ocs/191/testReport/(root)/Tests%20Suite/_Serial__sig_compute_Infrastructure_changes_to_the_kubernetes_client_on_the_virt_handler_rate_limiter_should_lead_to_delayed_VMI_running_states/
      https://main-jenkins-csb-cnvqe.apps.ocp-c1.prod.psi.redhat.com/job/test-kubevirt-cnv-4.15-storage-ocs/209/testReport/(root)/Tests%20Suite/_sig_storage__DataVolume_Integration__rfe_id_3188__crit_high__vendor_cnv_qe_redhat_com__level_system__Starting_a_VirtualMachineInstance_with_a_DataVolume_as_a_volume_source_Alpine_import__test_id_5252_should_be_successfully_started_when_using_a_PVC_volume_owned_by_a_DataVolume/

      Version-Release number of selected component (if applicable):

      CNV 4.15.0

      How reproducible:

      Very common on test lanes

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

      It's possible this is related to the new guest console log container.
      When such VMI is stuck, these logs are obserable on the node:
      
      sh-5.1# journalctl --no-pager | grep virt-launcher-vm-cirros-source-qwpcf | grep err
      Dec 20 16:15:40 alex-rc0-w77cr-worker-0-z5pg2 kubenswrapper[3798]: E1220 16:15:40.865833    3798 kuberuntime_container.go:750] "Container termination failed with gracePeriod" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="default/virt-launcher-vm-cirros-source-qwpcf" podUID="09b06c37-7f7e-4a77-9378-e72fe8e0d8bc" containerName="guest-console-log" containerID="cri-o://f9930f6da1186a1be274a4d673ff630481f1b7bd5100cb40c93b6d3429c983d8" gracePeriod=30
      Dec 20 16:15:40 alex-rc0-w77cr-worker-0-z5pg2 kubenswrapper[3798]: E1220 16:15:40.865887    3798 kuberuntime_container.go:775] "Kill container failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="default/virt-launcher-vm-cirros-source-qwpcf" podUID="09b06c37-7f7e-4a77-9378-e72fe8e0d8bc" containerName="guest-console-log" containerID={"Type":"cri-o","ID":"f9930f6da1186a1be274a4d673ff630481f1b7bd5100cb40c93b6d3429c983d8"}
      Dec 20 16:15:41 alex-rc0-w77cr-worker-0-z5pg2 kubenswrapper[3798]: E1220 16:15:41.164844    3798 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"KillContainer\" for \"guest-console-log\" with KillContainerError: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"" pod="default/virt-launcher-vm-cirros-source-qwpcf" podUID="09b06c37-7f7e-4a77-9378-e72fe8e0d8bc"
      
      Is it possible the guest console log container is not reacting very nicely to termination signals at all times?

        1. ea2a721ff45ef.log.gz
          4 kB
          Simone Tiraboschi
        2. kubelet.log.gz
          3.05 MB
          Simone Tiraboschi

              lpivarc Luboslav Pivarc
              akalenyu Alex Kalenyuk
              Alex Kalenyuk
              Kedar Bidarkar Kedar Bidarkar
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: