Loading...

Type: Bug
Resolution: Obsolete
Priority: Major
Fix Version/s: CNV v4.16.0
Affects Version/s: CNV v4.15.0
Component/s: CNV Virtualization
Labels:
None

Activity Type:
Quality / Stability / Reliability
Story Points:
0.42
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Market:

Severity:
Moderate

Regression:
No

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

We see a surge of failing tests across all kubevirt lanes where a VMI fails to delete within a timeout of 2 minutes:
https://main-jenkins-csb-cnvqe.apps.ocp-c1.prod.psi.redhat.com/job/test-kubevirt-cnv-4.15-network-ovn-ocs/172/testReport/(root)/Tests%20Suite/_sig_network__Services_Masquerade_interface_binding__without__a_service_matching_the_vmi_exposed_should_fail_to_reach_the_vmi/
https://main-jenkins-csb-cnvqe.apps.ocp-c1.prod.psi.redhat.com/job/test-kubevirt-cnv-4.15-compute-ocs/191/testReport/(root)/Tests%20Suite/_Serial__sig_compute_Infrastructure_changes_to_the_kubernetes_client_on_the_virt_handler_rate_limiter_should_lead_to_delayed_VMI_running_states/
https://main-jenkins-csb-cnvqe.apps.ocp-c1.prod.psi.redhat.com/job/test-kubevirt-cnv-4.15-storage-ocs/209/testReport/(root)/Tests%20Suite/_sig_storage__DataVolume_Integration__rfe_id_3188__crit_high__vendor_cnv_qe_redhat_com__level_system__Starting_a_VirtualMachineInstance_with_a_DataVolume_as_a_volume_source_Alpine_import__test_id_5252_should_be_successfully_started_when_using_a_PVC_volume_owned_by_a_DataVolume/

Version-Release number of selected component (if applicable):

CNV 4.15.0

How reproducible:

Very common on test lanes

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

It's possible this is related to the new guest console log container.
When such VMI is stuck, these logs are obserable on the node:

sh-5.1# journalctl --no-pager | grep virt-launcher-vm-cirros-source-qwpcf | grep err
Dec 20 16:15:40 alex-rc0-w77cr-worker-0-z5pg2 kubenswrapper[3798]: E1220 16:15:40.865833    3798 kuberuntime_container.go:750] "Container termination failed with gracePeriod" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="default/virt-launcher-vm-cirros-source-qwpcf" podUID="09b06c37-7f7e-4a77-9378-e72fe8e0d8bc" containerName="guest-console-log" containerID="cri-o://f9930f6da1186a1be274a4d673ff630481f1b7bd5100cb40c93b6d3429c983d8" gracePeriod=30
Dec 20 16:15:40 alex-rc0-w77cr-worker-0-z5pg2 kubenswrapper[3798]: E1220 16:15:40.865887    3798 kuberuntime_container.go:775] "Kill container failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" pod="default/virt-launcher-vm-cirros-source-qwpcf" podUID="09b06c37-7f7e-4a77-9378-e72fe8e0d8bc" containerName="guest-console-log" containerID={"Type":"cri-o","ID":"f9930f6da1186a1be274a4d673ff630481f1b7bd5100cb40c93b6d3429c983d8"}
Dec 20 16:15:41 alex-rc0-w77cr-worker-0-z5pg2 kubenswrapper[3798]: E1220 16:15:41.164844    3798 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"KillContainer\" for \"guest-console-log\" with KillContainerError: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\"" pod="default/virt-launcher-vm-cirros-source-qwpcf" podUID="09b06c37-7f7e-4a77-9378-e72fe8e0d8bc"

Is it possible the guest console log container is not reacting very nicely to termination signals at all times?