-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
5
-
False
-
-
False
-
CLOSED
-
---
-
---
-
-
-
CNV Virtualization Sprint 237
-
High
-
None
+++ This bug was initially created as a clone of Bug #2171395 +++
Description of problem:
The evacuation controller hits out-of-bound slice access, which leads to virt-controller panic. The issue is that the index calculations in the evacuation controller is not protected against an occasion of negative values. Thus it can happen that the index would be negative.
Version-Release number of selected component (if applicable):
4.13.0
How reproducible:
100%
Steps to Reproduce:
0. IMPORTANT: make sure that Kubevirt control-plane components are deployed on infra nodes, not on worker nodes. The reproduction involves drain, and we don't want the controllers to be evicted during reproduction.
1. In Kubevirt configuration, set
spec.configuration.migrations.parallelMigrationsPerCluster: 200
spec.configuration.migrations.parallelOutboundMigrationsPerNode: 100
2. Add custom label on one of the worker nodes, for example "type=worker001"
3. Create 5 migratable VMIs with nodeSelector of "type=worker001"
4. Drain the worker node with the label "type=worker001"
5. Make sure you see 5 pending VM instance migrations "oc get vmim"
6. Wait 4-5 minutes, observe the virt-controller pods status
Actual results:
Expected results:
Additional info:
— Additional comment from zhe peng on 2023-03-30 07:19:26 UTC —
test with build:CNV-v4.13.0.rhel9-1884
step:
1. check control-plane components not in worker nodes
$ oc get nodes
NAME STATUS ROLES AGE VERSION
c01-zpeng-413-dff6b-master-0 Ready control-plane,master 43h v1.26.2+dc93b13
c01-zpeng-413-dff6b-master-1 Ready control-plane,master 43h v1.26.2+dc93b13
c01-zpeng-413-dff6b-master-2 Ready control-plane,master 43h v1.26.2+dc93b13
c01-zpeng-413-dff6b-worker-0-fdmgv Ready worker 43h v1.26.2+dc93b13
c01-zpeng-413-dff6b-worker-0-j6bj6 Ready worker 43h v1.26.2+dc93b13
c01-zpeng-413-dff6b-worker-0-jfjgb Ready worker 43h v1.26.2+dc93b13
2. set migration config in kubevirt cr
migrations:
allowAutoConverge: false
allowPostCopy: false
completionTimeoutPerGiB: 800
parallelMigrationsPerCluster: 200
parallelOutboundMigrationsPerNode: 100
progressTimeout: 150
3. add label in worker node
$ oc label node c01-zpeng-413-dff6b-worker-0-fdmgv type=worker001
node/c01-zpeng-413-dff6b-worker-0-fdmgv labeled
4. create 5 vms and add nodeSelector
spec:
nodeSelector:
type: worker001
$ oc get vmi
NAME AGE PHASE IP NODENAME READY
vm-fedora1 17m Running 10.131.0.231 c01-zpeng-413-dff6b-worker-0-fdmgv True
vm-fedora2 15m Running 10.131.0.232 c01-zpeng-413-dff6b-worker-0-fdmgv True
vm-fedora3 11m Running 10.131.0.234 c01-zpeng-413-dff6b-worker-0-fdmgv True
vm-fedora4 8m13s Running 10.131.0.235 c01-zpeng-413-dff6b-worker-0-fdmgv True
vm-fedora5 4m37s Running 10.131.0.236 c01-zpeng-413-dff6b-worker-0-fdmgv True
5. Drain the worker node with label
$ oc adm cordon c01-zpeng-413-dff6b-worker-0-fdmgv
node/c01-zpeng-413-dff6b-worker-0-fdmgv cordoned
$ oc adm drain c01-zpeng-413-dff6b-worker-0-fdmgv --ignore-daemonsets=true --delete-emptydir-data=true
6. make sure there are 5 pending migration
$ oc get vmim
NAME PHASE VMI
kubevirt-evacuation-2hvw6 Scheduling vm-fedora1
kubevirt-evacuation-5tfgc Scheduling vm-fedora2
kubevirt-evacuation-6zkst Scheduling vm-fedora5
kubevirt-evacuation-gzwbx Scheduling vm-fedora4
kubevirt-evacuation-h2tlv Scheduling vm-fedora3
wait more than 10mins , observe the virt-controller pods status
$ oc get pods -n openshift-cnv | grep virt-controller
virt-controller-5cc6f78f8f-nvd59 1/1 Running 0 14m
virt-controller-5cc6f78f8f-s2wdb 1/1 Running 0 43h
no panic happened.
move to verified.
— Additional comment from errata-xmlrpc on 2023-04-01 05:33:05 UTC —
This bug has been added to advisory RHEA-2022:101182 by CPaaS, owned by Greg Allen (contra-dev/pipeline@REDHAT.COM)