-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
None
-
-
None
Description of problem:
virt-launcher pod fails to MapVolume after node outage(created by introducing 60 seconds of network latency for 15mins) causes the VMI to be stuck in scheduling state. I was able to recreate the issue 2 out of 10 times.
Version-Release number of selected component (if applicable):
openshift-cnv kubevirt-hyperconverged-operator.v4.18.13 OpenShift Virtualization 4.18.13 kubevirt-hyperconverged-operator.v4.18.11 Succeeded openshift-ovirt-infra node-healthcheck-operator.v0.10.0 Node Health Check Operator 0.10.0 node-healthcheck-operator.v0.9.1 Succeeded openshift-storage cephcsi-operator.v4.18.11-rhodf CephCSI operator 4.18.11-rhodf cephcsi-operator.v4.18.10-rhodf Succeeded openshift-storage mcg-operator.v4.18.11-rhodf NooBaa Operator 4.18.11-rhodf mcg-operator.v4.18.10-rhodf Succeeded openshift-storage ocs-client-operator.v4.18.11-rhodf OpenShift Data Foundation Client 4.18.11-rhodf ocs-client-operator.v4.18.10-rhodf Succeeded openshift-storage ocs-operator.v4.18.11-rhodf OpenShift Container Storage 4.18.11-rhodf ocs-operator.v4.18.10-rhodf Succeeded openshift-storage odf-csi-addons-operator.v4.18.11-rhodf CSI Addons 4.18.11-rhodf odf-csi-addons-operator.v4.18.10-rhodf Succeeded openshift-storage odf-dependencies.v4.18.11-rhodf Data Foundation Dependencies 4.18.11-rhodf odf-dependencies.v4.18.10-rhodf Succeeded openshift-storage odf-operator.v4.18.11-rhodf OpenShift Data Foundation 4.18.11-rhodf odf-operator.v4.18.10-rhodf Succeeded openshift-storage odf-prometheus-operator.v4.18.11-rhodf Prometheus Operator 4.18.11-rhodf odf-prometheus-operator.v4.18.10-rhodf Succeeded openshift-storage recipe.v4.18.11-rhodf Recipe 4.18.11-rhodf recipe.v4.18.10-rhodf Succeeded openshift-storage rook-ceph-operator.v4.18.11-rhodf Rook-Ceph 4.18.11-rhodf rook-ceph-operator.v4.18.10-rhodf Succeeded
How reproducible:
1. Introduced 60s network latency for 15mins on 2 worker nodes in same ODF zone(with one worker hosting the MON pod) 2. the 2 worker nodes become unready 3 Some of the VMs on the worker node stuck at scheduling state, due to below issue 4. I was able to recreate the issue 2 out of 10 times. Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMapVolume 4m9s (x121 over 10h) kubelet MapVolume.SetUpDevice failed for volume "pvc-b7db50ea-0637-4653-aec7-05d40201b4d8" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 108) occurred while running rbd args: [--id csi-rbd-node -m 172.30.137.156:3300,172.30.245.95:3300,172.30.86.193:3300 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a7c2f767-3838-4efa-9cf3-b23e02ef619b --device-type krbd --options noudev --options read_from_replica=localize,crush_location=host:e09-h12-000-r660|rack:rack1], rbd error output: rbd: sysfs write failed rbd: map failed: (108) Cannot send after transport endpoint shutdown Warning FailedMapVolume 3s (x187 over 10h) kubelet MapVolume.SetUpDevice failed for volume "pvc-b7db50ea-0637-4653-aec7-05d40201b4d8" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 108) occurred while running rbd args: [--id csi-rbd-node -m 172.30.137.156:3300,172.30.245.95:3300,172.30.86.193:3300 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a7c2f767-3838-4efa-9cf3-b23e02ef619b --device-type krbd --options noudev --options read_from_replica=localize,crush_location=rack:rack1|host:e09-h12-000-r660], rbd error output: rbd: sysfs write failed
Steps to Reproduce:
1.Introduced 60s network latency for 15mins on 2 worker nodes in same ODF zone(with one worker hosting the MON pod) podman run --rm -e LABEL_SELECTOR="chaos=odf" -e INSTANCE_COUNT=1 -e DURATION=900 -e TRAFFIC_TYPE=egress -e EGRESS='{latency: 60000ms}' -e KUBECONFIG=/tmp/config -e KRKN_KUBE_CONFIG=/tmp/config -e DISTRIBUTION='openshift' --net=host -v /tmp/config:/tmp/config:Z quay.io/krkn-chaos/krkn-hub:network-chaos
Actual results:
Some of the VMs are stuck in scheduling state virt-clone-clones clone-vm-0-17 13h Scheduling False virt-clone-clones clone-vm-0-296 13h Scheduling False virt-clone-clones clone-vm-0-328 13h Scheduling False virt-clone-clones clone-vm-0-40 13h Scheduling False virt-launcher-clone-vm-0-17-c9qhz 0/1 ContainerCreating 0 13h virt-launcher-clone-vm-0-296-dzg4b 0/1 ContainerCreating 0 13h virt-launcher-clone-vm-0-328-rls4j 0/1 ContainerCreating 0 13h virt-launcher-clone-vm-0-40-897zb 0/1 ContainerCreating 0 13h
Expected results:
All the VMs(fedora VMs) should be in running state
Additional info:
- links to