Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-23732

[2156902] VM latency checkup - Checkup not performing a teardown in case of setup failure

XMLWordPrintable

    • cnv-netperf-20 230122
    • Medium
    • None

      Description of problem:
      When a checkup encounter a setup failure, the components created by the job are not deleted.

      Version-Release number of selected component (if applicable):
      kubevirt-hyperconverged-operator.v4.12.0 OpenShift Virtualization 4.12.0 kubevirt-hyperconverged-operator.v4.11.1 Succeeded

      Client Version: 4.12.0-rc.6
      Kustomize Version: v4.5.7
      Server Version: 4.12.0-rc.6
      Kubernetes Version: v1.25.4+77bec7a

      How reproducible:
      Create a checkup configmap with a nonexistent node specified as the source node. The first virt-launcher pod will stay in pending mode and will never get to a running state and the actual checkup will never start.

      Steps to Reproduce:
      1. create a Namespace
      oc new-project test-latency

      2. Create a Bridge with this yaml:
      apiVersion: nmstate.io/v1
      kind: NodeNetworkConfigurationPolicy
      metadata:
      name: br10
      spec:
      desiredState:
      interfaces:

      • bridge:
        options:
        stp:
        enabled: false
        port:
      • name: ens9
        ipv4:
        auto-dns: true
        dhcp: false
        enabled: false
        ipv6:
        auto-dns: true
        autoconf: false
        dhcp: false
        enabled: false
        name: br10
        state: up
        type: linux-bridge
        nodeSelector:
        node-role.kubernetes.io/worker: ''

      3. Create a NAD with this yaml:
      apiVersion: k8s.cni.cncf.io/v1
      kind: NetworkAttachmentDefinition
      metadata:
      name: bridge-network-nad
      spec:
      config: |
      {
      "cniVersion":"0.3.1",
      "name": "br10",
      "plugins": [

      { "type": "cnv-bridge", "bridge": "br10" }

      ]
      }
      ~

      4. Create a service-account, role, and role-binding:
      cat <<EOF | kubectl apply -f -

      apiVersion: v1
      kind: ServiceAccount
      metadata:
      name: vm-latency-checkup-sa

      apiVersion: rbac.authorization.k8s.io/v1
      kind: Role
      metadata:
      name: kubevirt-vm-latency-checker
      rules:

      • apiGroups: ["kubevirt.io"]
        resources: ["virtualmachineinstances"]
        verbs: ["get", "create", "delete"]
      • apiGroups: ["subresources.kubevirt.io"]
        resources: ["virtualmachineinstances/console"]
        verbs: ["get"]
      • apiGroups: ["k8s.cni.cncf.io"]
        resources: ["network-attachment-definitions"]
        verbs: ["get"]

        apiVersion: rbac.authorization.k8s.io/v1
        kind: RoleBinding
        metadata:
        name: kubevirt-vm-latency-checker
        subjects:
      • kind: ServiceAccount
        name: vm-latency-checkup-sa
        roleRef:
        kind: Role
        name: kubevirt-vm-latency-checker
        apiGroup: rbac.authorization.k8s.io

        apiVersion: rbac.authorization.k8s.io/v1
        kind: Role
        metadata:
        name: kiagnose-configmap-access
        rules:
      • apiGroups: [ "" ]
        resources: [ "configmaps" ]
        verbs: ["get", "update"]

        apiVersion: rbac.authorization.k8s.io/v1
        kind: RoleBinding
        metadata:
        name: kiagnose-configmap-access
        subjects:
      • kind: ServiceAccount
        name: vm-latency-checkup-sa
        roleRef:
        kind: Role
        name: kiagnose-configmap-access
        apiGroup: rbac.authorization.k8s.io
        EOF

      5. Create the ConfigMap with the "spec.param.max_desired_latency_milliseconds" filed set to 0:
      cat <<EOF | kubectl apply -f -

      apiVersion: v1
      kind: ConfigMap
      metadata:
      name: kubevirt-vm-latency-checkup-config
      data:
      spec.timeout: 5m
      spec.param.network_attachment_definition_namespace: "manual-latency-check"
      spec.param.network_attachment_definition_name: "bridge-network-nad"
      spec.param.max_desired_latency_milliseconds: "0"
      spec.param.sample_duration_seconds: "5"
      spec.param.source_node: non-existent-node
      spec.param.target_node: cnv-qe-14.cnvqe.lab.eng.rdu2.redhat.com
      EOF

      6. Create a job:
      cat <<EOF | kubectl apply -f -

      apiVersion: batch/v1
      kind: Job
      metadata:
      name: kubevirt-vm-latency-checkup
      spec:
      backoffLimit: 0
      template:
      spec:
      serviceAccountName: vm-latency-checkup-sa
      restartPolicy: Never
      containers:

      • name: vm-latency-checkup
        image: brew.registry.redhat.io/rh-osbs/container-native-virtualization-vm-network-latency-checkup:v4.12.0
        securityContext:
        runAsUser: 1000
        allowPrivilegeEscalation: false
        capabilities:
        drop: ["ALL"]
        runAsNonRoot: true
        seccompProfile:
        type: "RuntimeDefault"
        env:
      • name: CONFIGMAP_NAMESPACE
        value: test-latency
      • name: CONFIGMAP_NAME
        value: kubevirt-vm-latency-checkup-config
        EOF

      Actual results:
      When the job is deleted the pods and VMI's are not deleted:
      oc get all
      NAME READY STATUS RESTARTS AGE
      pod/latency-nonexistent-node-job-qt4wk 0/1 Error 0 74m
      pod/virt-launcher-latency-check-source-4fqgk 0/2 Pending 0 74m
      pod/virt-launcher-latency-check-target-smj9r 2/2 Running 0 74m

      NAME COMPLETIONS DURATION AGE
      job.batch/latency-nonexistent-node-job 0/1 74m 74m

      NAME AGE PHASE IP NODENAME READY
      virtualmachineinstance.kubevirt.io/latency-check-source 74m Scheduling False
      virtualmachineinstance.kubevirt.io/latency-check-target 74m Running 192.168.100.20 cnv-qe-14.cnvqe.lab.eng.rdu2.redhat.com True

      Expected results:
      All the resources created by the Job are deleted as the job gets deleted.

              jira-bugzilla-migration RH Bugzilla Integration
              rh-ee-awax Anat Wax
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: