-
Bug
-
Resolution: Done-Errata
-
Critical
-
None
-
None
-
None
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
https://github.com/kiagnose/kubevirt-dpdk-checkup/pull/218, https://github.com/kiagnose/kubevirt-dpdk-checkup/pull/219, https://github.com/kiagnose/kubevirt-dpdk-checkup/pull/220, https://github.com/kiagnose/kubevirt-dpdk-checkup/pull/221, https://github.com/kiagnose/kubevirt-dpdk-checkup/pull/223, https://github.com/kiagnose/kubevirt-dpdk-checkup/pull/224, https://github.com/kiagnose/kubevirt-dpdk-checkup/pull/225
-
-
No
Description of problem:
DPDK checkup occasionally fails with the following failure reason:
status.failureReason: 'failed to login to VMI "dpdk-checkup-ns/vmi-under-test-hgm62":
expect: timer expired after 60 seconds'
The suspicion is that this happens due to a race caused by the fix of https://issues.redhat.com/browse/CNV-35772. This requires a reboot of the VM, which, IIUC, cannot always be performed by the job.
Version-Release number of selected component (if applicable):
OCP/CNV 4.15.0 DPDK checkup job: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-kubevirt-dpdk-checkup-rhel9:v4.15.0-46 Traffic generator VMI: quay.io/kiagnose/kubevirt-dpdk-checkup-traffic-gen:v0.3.0 VMI under test: quay.io/kiagnose/kubevirt-dpdk-checkup-vm:v0.3.0
How reproducible:
~50%
Steps to Reproduce:
1. Create and switch to a new namespace for the job to run in: $ oc create ns dpdk-checkup-ns namespace/dpdk-checkup-ns created $ oc project dpdk-checkup-ns Now using project "dpdk-checkup-ns" on server "https://api.cnvcl3.lab.eng.tlv2.redhat.com:6443". 2. On a bare-metal cluster with SR-IOV and DPDK support (CNV QE deployment jobs can deploy such cluster) apply the attached `5-dpdk-checkup-resources.yaml`, which creates Role, RoleBinding and ServiceAccount resources, needed for the DPDK checkup. $ oc apply -f 5-dpdk-checkup-resources.yaml serviceaccount/dpdk-checkup-sa created role.rbac.authorization.k8s.io/kiagnose-configmap-access created rolebinding.rbac.authorization.k8s.io/kiagnose-configmap-access created role.rbac.authorization.k8s.io/kubevirt-dpdk-checker created rolebinding.rbac.authorization.k8s.io/kubevirt-dpdk-checker created 3. Apply the job's ConfigMap: $ cat << EOF | oc apply -f - > apiVersion: v1 kind: ConfigMap metadata: name: dpdk-checkup-config data: spec.timeout: 20m spec.param.testDuration: 120s spec.param.networkAttachmentDefinitionName: default/sriov-network spec.param.verbose: "true" spec.param.trafficGenPacketsPerSecond: 8m spec.param.trafficGenContainerDiskImage: "quay.io/kiagnose/kubevirt-dpdk-checkup-traffic-gen:v0.3.0" spec.param.vmUnderTestContainerDiskImage: "quay.io/kiagnose/kubevirt-dpdk-checkup-vm:v0.3.0" > EOF configmap/dpdk-checkup-config created Note: You might need to change the `networkAttachmentDefinitionName` is you use another net-attach-def. 4. Run the job by applying the attached `7-dpdk-checkup-job.yaml` $ oc apply -f 7-dpdk-checkup-job.yaml job.batch/dpdk-checkup created and wait for the job to be completed.
Actual results:
If you run several attempts - some end successfully, and some fail, with the following failure reason specified in the output ConfigMap:
status.failureReason: 'failed to login to VMI "dpdk-checkup-ns/vmi-under-test-hgm62":
expect: timer expired after 60 seconds'
Expected results:
All attempts should end successfully.
Additional info:
1. On 10 attempts I ran, 6 ended with this failure. 2. On the example above, the failure reason states there was a failure to login to the VMI under test. This is the case in some of the failed runs, while in other failed attempts, the same failure message happens on the traffic-gen VMI.
- links to
- mentioned on
(3 links to, 1 mentioned on)