Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-27938

[2185068] virt-controller crashes because of out-of-bound slice access in evacuation controller

XMLWordPrintable

    • CNV Virtualization Sprint 237
    • High
    • None

      +++ This bug was initially created as a clone of Bug #2171395 +++

      Description of problem:
      The evacuation controller hits out-of-bound slice access, which leads to virt-controller panic. The issue is that the index calculations in the evacuation controller is not protected against an occasion of negative values. Thus it can happen that the index would be negative.

      Version-Release number of selected component (if applicable):
      4.13.0

      How reproducible:
      100%

      Steps to Reproduce:
      0. IMPORTANT: make sure that Kubevirt control-plane components are deployed on infra nodes, not on worker nodes. The reproduction involves drain, and we don't want the controllers to be evicted during reproduction.
      1. In Kubevirt configuration, set
      spec.configuration.migrations.parallelMigrationsPerCluster: 200
      spec.configuration.migrations.parallelOutboundMigrationsPerNode: 100

      2. Add custom label on one of the worker nodes, for example "type=worker001"
      3. Create 5 migratable VMIs with nodeSelector of "type=worker001"
      4. Drain the worker node with the label "type=worker001"
      5. Make sure you see 5 pending VM instance migrations "oc get vmim"
      6. Wait 4-5 minutes, observe the virt-controller pods status

      Actual results:

      Expected results:

      Additional info:

      — Additional comment from zhe peng on 2023-03-30 07:19:26 UTC —

      test with build:CNV-v4.13.0.rhel9-1884

      step:
      1. check control-plane components not in worker nodes
      $ oc get nodes
      NAME STATUS ROLES AGE VERSION
      c01-zpeng-413-dff6b-master-0 Ready control-plane,master 43h v1.26.2+dc93b13
      c01-zpeng-413-dff6b-master-1 Ready control-plane,master 43h v1.26.2+dc93b13
      c01-zpeng-413-dff6b-master-2 Ready control-plane,master 43h v1.26.2+dc93b13
      c01-zpeng-413-dff6b-worker-0-fdmgv Ready worker 43h v1.26.2+dc93b13
      c01-zpeng-413-dff6b-worker-0-j6bj6 Ready worker 43h v1.26.2+dc93b13
      c01-zpeng-413-dff6b-worker-0-jfjgb Ready worker 43h v1.26.2+dc93b13

      2. set migration config in kubevirt cr
      migrations:
      allowAutoConverge: false
      allowPostCopy: false
      completionTimeoutPerGiB: 800
      parallelMigrationsPerCluster: 200
      parallelOutboundMigrationsPerNode: 100
      progressTimeout: 150

      3. add label in worker node
      $ oc label node c01-zpeng-413-dff6b-worker-0-fdmgv type=worker001
      node/c01-zpeng-413-dff6b-worker-0-fdmgv labeled

      4. create 5 vms and add nodeSelector
      spec:
      nodeSelector:
      type: worker001

      $ oc get vmi
      NAME AGE PHASE IP NODENAME READY
      vm-fedora1 17m Running 10.131.0.231 c01-zpeng-413-dff6b-worker-0-fdmgv True
      vm-fedora2 15m Running 10.131.0.232 c01-zpeng-413-dff6b-worker-0-fdmgv True
      vm-fedora3 11m Running 10.131.0.234 c01-zpeng-413-dff6b-worker-0-fdmgv True
      vm-fedora4 8m13s Running 10.131.0.235 c01-zpeng-413-dff6b-worker-0-fdmgv True
      vm-fedora5 4m37s Running 10.131.0.236 c01-zpeng-413-dff6b-worker-0-fdmgv True

      5. Drain the worker node with label
      $ oc adm cordon c01-zpeng-413-dff6b-worker-0-fdmgv
      node/c01-zpeng-413-dff6b-worker-0-fdmgv cordoned

      $ oc adm drain c01-zpeng-413-dff6b-worker-0-fdmgv --ignore-daemonsets=true --delete-emptydir-data=true

      6. make sure there are 5 pending migration
      $ oc get vmim
      NAME PHASE VMI
      kubevirt-evacuation-2hvw6 Scheduling vm-fedora1
      kubevirt-evacuation-5tfgc Scheduling vm-fedora2
      kubevirt-evacuation-6zkst Scheduling vm-fedora5
      kubevirt-evacuation-gzwbx Scheduling vm-fedora4
      kubevirt-evacuation-h2tlv Scheduling vm-fedora3

      wait more than 10mins , observe the virt-controller pods status
      $ oc get pods -n openshift-cnv | grep virt-controller
      virt-controller-5cc6f78f8f-nvd59 1/1 Running 0 14m
      virt-controller-5cc6f78f8f-s2wdb 1/1 Running 0 43h

      no panic happened.
      move to verified.

      — Additional comment from errata-xmlrpc on 2023-04-01 05:33:05 UTC —

      This bug has been added to advisory RHEA-2022:101182 by CPaaS, owned by Greg Allen (contra-dev/pipeline@REDHAT.COM)

              lpivarc Luboslav Pivarc
              robertkrawitz Robert Krawitz
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: