Loading...

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: CNV v4.12.4
Affects Version/s: None
Component/s: CNV Virtualization
Labels:
- cnv-4+
- cnvbugsm
- devel_ack+
- pm_ack+
- qa_ack+

Activity Type:
Quality / Stability / Reliability
Story Points:
5
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
BZ Status:
CLOSED
BZ URL:
https://bugzilla.redhat.com/show_bug.cgi?id=2185068
Bugzilla Bug:
RHBZ: 2185068
Intelligence Requested:
Market:

Sprint:
CNV Virtualization Sprint 237
Severity:
Important

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

+++ This bug was initially created as a clone of Bug #2171395 +++

Description of problem:
The evacuation controller hits out-of-bound slice access, which leads to virt-controller panic. The issue is that the index calculations in the evacuation controller is not protected against an occasion of negative values. Thus it can happen that the index would be negative.

Version-Release number of selected component (if applicable):
4.13.0

How reproducible:
100%

Steps to Reproduce:
0. IMPORTANT: make sure that Kubevirt control-plane components are deployed on infra nodes, not on worker nodes. The reproduction involves drain, and we don't want the controllers to be evicted during reproduction.
1. In Kubevirt configuration, set
spec.configuration.migrations.parallelMigrationsPerCluster: 200
spec.configuration.migrations.parallelOutboundMigrationsPerNode: 100

2. Add custom label on one of the worker nodes, for example "type=worker001"
3. Create 5 migratable VMIs with nodeSelector of "type=worker001"
4. Drain the worker node with the label "type=worker001"
5. Make sure you see 5 pending VM instance migrations "oc get vmim"
6. Wait 4-5 minutes, observe the virt-controller pods status

Actual results:

Expected results:

Additional info:

— Additional comment from zhe peng on 2023-03-30 07:19:26 UTC —

test with build:CNV-v4.13.0.rhel9-1884

step:
1. check control-plane components not in worker nodes
$ oc get nodes
NAME STATUS ROLES AGE VERSION
c01-zpeng-413-dff6b-master-0 Ready control-plane,master 43h v1.26.2+dc93b13
c01-zpeng-413-dff6b-master-1 Ready control-plane,master 43h v1.26.2+dc93b13
c01-zpeng-413-dff6b-master-2 Ready control-plane,master 43h v1.26.2+dc93b13
c01-zpeng-413-dff6b-worker-0-fdmgv Ready worker 43h v1.26.2+dc93b13
c01-zpeng-413-dff6b-worker-0-j6bj6 Ready worker 43h v1.26.2+dc93b13
c01-zpeng-413-dff6b-worker-0-jfjgb Ready worker 43h v1.26.2+dc93b13

2. set migration config in kubevirt cr
migrations:
allowAutoConverge: false
allowPostCopy: false
completionTimeoutPerGiB: 800
parallelMigrationsPerCluster: 200
parallelOutboundMigrationsPerNode: 100
progressTimeout: 150

3. add label in worker node
$ oc label node c01-zpeng-413-dff6b-worker-0-fdmgv type=worker001
node/c01-zpeng-413-dff6b-worker-0-fdmgv labeled

4. create 5 vms and add nodeSelector
spec:
nodeSelector:
type: worker001

$ oc get vmi
NAME AGE PHASE IP NODENAME READY
vm-fedora1 17m Running 10.131.0.231 c01-zpeng-413-dff6b-worker-0-fdmgv True
vm-fedora2 15m Running 10.131.0.232 c01-zpeng-413-dff6b-worker-0-fdmgv True
vm-fedora3 11m Running 10.131.0.234 c01-zpeng-413-dff6b-worker-0-fdmgv True
vm-fedora4 8m13s Running 10.131.0.235 c01-zpeng-413-dff6b-worker-0-fdmgv True
vm-fedora5 4m37s Running 10.131.0.236 c01-zpeng-413-dff6b-worker-0-fdmgv True

5. Drain the worker node with label
$ oc adm cordon c01-zpeng-413-dff6b-worker-0-fdmgv
node/c01-zpeng-413-dff6b-worker-0-fdmgv cordoned

$ oc adm drain c01-zpeng-413-dff6b-worker-0-fdmgv --ignore-daemonsets=true --delete-emptydir-data=true

6. make sure there are 5 pending migration
$ oc get vmim
NAME PHASE VMI
kubevirt-evacuation-2hvw6 Scheduling vm-fedora1
kubevirt-evacuation-5tfgc Scheduling vm-fedora2
kubevirt-evacuation-6zkst Scheduling vm-fedora5
kubevirt-evacuation-gzwbx Scheduling vm-fedora4
kubevirt-evacuation-h2tlv Scheduling vm-fedora3

wait more than 10mins , observe the virt-controller pods status
$ oc get pods -n openshift-cnv | grep virt-controller
virt-controller-5cc6f78f8f-nvd59 1/1 Running 0 14m
virt-controller-5cc6f78f8f-s2wdb 1/1 Running 0 43h

no panic happened.
move to verified.

— Additional comment from errata-xmlrpc on 2023-04-01 05:33:05 UTC —

This bug has been added to advisory RHEA-2022:101182 by CPaaS, owned by Greg Allen (contra-dev/pipeline@REDHAT.COM)

is blocked by

CNV-25883 [2171395] virt-controller crashes because of out-of-bound slice access in evacuation controller

Closed

is duplicated by

CNV-29383 [2212198] both virt-controllers are crashing due to panic

Closed

external trackers

Github kubevirt/kubevirt/pull/9817

Red Hat Errata Tool 115607

Red Hat Product Errata RHEA-2023:3889

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates