Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-37421

[DPDK checkup] Job occasionally fails due to VMI login failure

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • None
    • None
    • CNV Network
    • None

      Description of problem:

      DPDK checkup occasionally fails with the following failure reason:
        status.failureReason: 'failed to login to VMI "dpdk-checkup-ns/vmi-under-test-hgm62":
          expect: timer expired after 60 seconds'
      
      The suspicion is that this happens due to a race caused by the fix of https://issues.redhat.com/browse/CNV-35772. This requires a reboot of the VM, which, IIUC, cannot always be performed by the job.
      

      Version-Release number of selected component (if applicable):

      OCP/CNV 4.15.0
      DPDK checkup job: registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-kubevirt-dpdk-checkup-rhel9:v4.15.0-46
      Traffic generator VMI: quay.io/kiagnose/kubevirt-dpdk-checkup-traffic-gen:v0.3.0
      VMI under test: quay.io/kiagnose/kubevirt-dpdk-checkup-vm:v0.3.0
      

      How reproducible:

      ~50%
       

      Steps to Reproduce:

      1.
      Create and switch to a new namespace for the job to run in:
      
      $ oc create ns dpdk-checkup-ns
      namespace/dpdk-checkup-ns created
      $ oc project dpdk-checkup-ns 
      Now using project "dpdk-checkup-ns" on server "https://api.cnvcl3.lab.eng.tlv2.redhat.com:6443".
      
      2.
      On a bare-metal cluster with SR-IOV and DPDK support (CNV QE deployment jobs can deploy such cluster) apply the attached `5-dpdk-checkup-resources.yaml`, which creates Role, RoleBinding and ServiceAccount resources, needed for the DPDK checkup.
      
      $ oc apply -f 5-dpdk-checkup-resources.yaml 
      serviceaccount/dpdk-checkup-sa created
      role.rbac.authorization.k8s.io/kiagnose-configmap-access created
      rolebinding.rbac.authorization.k8s.io/kiagnose-configmap-access created
      role.rbac.authorization.k8s.io/kubevirt-dpdk-checker created
      rolebinding.rbac.authorization.k8s.io/kubevirt-dpdk-checker created
      
      3.
      Apply the job's ConfigMap:
      
      $ cat << EOF | oc apply -f -
      > apiVersion: v1
      kind: ConfigMap
      metadata:
        name: dpdk-checkup-config
      data:
        spec.timeout: 20m 
        spec.param.testDuration: 120s
        spec.param.networkAttachmentDefinitionName: default/sriov-network
        spec.param.verbose: "true"
        spec.param.trafficGenPacketsPerSecond: 8m
        spec.param.trafficGenContainerDiskImage: "quay.io/kiagnose/kubevirt-dpdk-checkup-traffic-gen:v0.3.0"                                                                         spec.param.vmUnderTestContainerDiskImage: "quay.io/kiagnose/kubevirt-dpdk-checkup-vm:v0.3.0"
      > EOF
      configmap/dpdk-checkup-config created
      
      Note: You might need to change the `networkAttachmentDefinitionName` is you use another net-attach-def.
      
      4.
      Run the job by applying the attached `7-dpdk-checkup-job.yaml`
      
      $ oc apply -f 7-dpdk-checkup-job.yaml 
      job.batch/dpdk-checkup created
      and wait for the job to be completed.
      

      Actual results:

      If you run several attempts - some end successfully, and some fail, with the following failure reason specified in the output ConfigMap:
      
        status.failureReason: 'failed to login to VMI "dpdk-checkup-ns/vmi-under-test-hgm62":
          expect: timer expired after 60 seconds'
      

      Expected results:

      All attempts should end successfully.
      

      Additional info:

      1.
      On 10 attempts I ran, 6 ended with this failure.
      
      2.
      On the example above, the failure reason states there was a failure to login to the VMI under test. This is the case in some of the failed runs, while in other failed attempts, the same failure message happens on the traffic-gen VMI.
       

        1. 5-dpdk-checkup-resources.yaml
          1 kB
          Yossi Segev
        2. 7-dpdk-checkup-job.yaml
          1.0 kB
          Yossi Segev

              ralavi@redhat.com Ram Lavi
              ysegev@redhat.com Yossi Segev
              Yossi Segev Yossi Segev
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: