Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: CNV Network
Labels:
None

Activity Type:
Quality / Stability / Reliability
Story Points:
0.42
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
CNV-24770
Git Pull Request:
https://github.com/kiagnose/kubevirt-dpdk-checkup/pull/218, https://github.com/kiagnose/kubevirt-dpdk-checkup/pull/219, https://github.com/kiagnose/kubevirt-dpdk-checkup/pull/220, https://github.com/kiagnose/kubevirt-dpdk-checkup/pull/221, https://github.com/kiagnose/kubevirt-dpdk-checkup/pull/223, https://github.com/kiagnose/kubevirt-dpdk-checkup/pull/224, https://github.com/kiagnose/kubevirt-dpdk-checkup/pull/225
Market:

Regression:
No

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

DPDK checkup occasionally fails with the following failure reason:
  status.failureReason: 'failed to login to VMI "dpdk-checkup-ns/vmi-under-test-hgm62":
    expect: timer expired after 60 seconds'

The suspicion is that this happens due to a race caused by the fix of https://issues.redhat.com/browse/CNV-35772. This requires a reboot of the VM, which, IIUC, cannot always be performed by the job.

Version-Release number of selected component (if applicable):

OCP/CNV 4.15.0
DPDK checkup job: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-kubevirt-dpdk-checkup-rhel9:v4.15.0-46
Traffic generator VMI: quay.io/kiagnose/kubevirt-dpdk-checkup-traffic-gen:v0.3.0
VMI under test: quay.io/kiagnose/kubevirt-dpdk-checkup-vm:v0.3.0

How reproducible:

~50%

Steps to Reproduce:

1.
Create and switch to a new namespace for the job to run in:

$ oc create ns dpdk-checkup-ns
namespace/dpdk-checkup-ns created
$ oc project dpdk-checkup-ns 
Now using project "dpdk-checkup-ns" on server "https://api.cnvcl3.lab.eng.tlv2.redhat.com:6443".

2.
On a bare-metal cluster with SR-IOV and DPDK support (CNV QE deployment jobs can deploy such cluster) apply the attached `5-dpdk-checkup-resources.yaml`, which creates Role, RoleBinding and ServiceAccount resources, needed for the DPDK checkup.

$ oc apply -f 5-dpdk-checkup-resources.yaml 
serviceaccount/dpdk-checkup-sa created
role.rbac.authorization.k8s.io/kiagnose-configmap-access created
rolebinding.rbac.authorization.k8s.io/kiagnose-configmap-access created
role.rbac.authorization.k8s.io/kubevirt-dpdk-checker created
rolebinding.rbac.authorization.k8s.io/kubevirt-dpdk-checker created

3.
Apply the job's ConfigMap:

$ cat << EOF | oc apply -f -
> apiVersion: v1
kind: ConfigMap
metadata:
  name: dpdk-checkup-config
data:
  spec.timeout: 20m 
  spec.param.testDuration: 120s
  spec.param.networkAttachmentDefinitionName: default/sriov-network
  spec.param.verbose: "true"
  spec.param.trafficGenPacketsPerSecond: 8m
  spec.param.trafficGenContainerDiskImage: "quay.io/kiagnose/kubevirt-dpdk-checkup-traffic-gen:v0.3.0"                                                                         spec.param.vmUnderTestContainerDiskImage: "quay.io/kiagnose/kubevirt-dpdk-checkup-vm:v0.3.0"
> EOF
configmap/dpdk-checkup-config created

Note: You might need to change the `networkAttachmentDefinitionName` is you use another net-attach-def.

4.
Run the job by applying the attached `7-dpdk-checkup-job.yaml`

$ oc apply -f 7-dpdk-checkup-job.yaml 
job.batch/dpdk-checkup created
and wait for the job to be completed.

Actual results:

If you run several attempts - some end successfully, and some fail, with the following failure reason specified in the output ConfigMap:

  status.failureReason: 'failed to login to VMI "dpdk-checkup-ns/vmi-under-test-hgm62":
    expect: timer expired after 60 seconds'

Expected results:

All attempts should end successfully.

Additional info:

1.
On 10 attempts I ran, 6 ended with this failure.

2.
On the example above, the failure reason states there was a failure to login to the VMI under test. This is the case in some of the failed runs, while in other failed attempts, the same failure message happens on the traffic-gen VMI.