Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: Storage Platform
Labels:
- chaos

Activity Type:
Quality / Stability / Reliability
Story Points:
0.42
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Component Fix Version(s):
None
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Description of problem:

virt-launcher pod fails to MapVolume after node outage(created by introducing  60 seconds of network latency for 15mins) causes the VMI to be stuck in scheduling state.
I was able to recreate the issue 2 out of 10 times.

Version-Release number of selected component (if applicable):

openshift-cnv                                      kubevirt-hyperconverged-operator.v4.18.13         OpenShift Virtualization            4.18.13               kubevirt-hyperconverged-operator.v4.18.11         Succeeded
openshift-ovirt-infra                              node-healthcheck-operator.v0.10.0                 Node Health Check Operator          0.10.0                node-healthcheck-operator.v0.9.1                  Succeeded
openshift-storage                                  cephcsi-operator.v4.18.11-rhodf                   CephCSI operator                    4.18.11-rhodf         cephcsi-operator.v4.18.10-rhodf                   Succeeded
openshift-storage                                  mcg-operator.v4.18.11-rhodf                       NooBaa Operator                     4.18.11-rhodf         mcg-operator.v4.18.10-rhodf                       Succeeded
openshift-storage                                  ocs-client-operator.v4.18.11-rhodf                OpenShift Data Foundation Client    4.18.11-rhodf         ocs-client-operator.v4.18.10-rhodf                Succeeded
openshift-storage                                  ocs-operator.v4.18.11-rhodf                       OpenShift Container Storage         4.18.11-rhodf         ocs-operator.v4.18.10-rhodf                       Succeeded
openshift-storage                                  odf-csi-addons-operator.v4.18.11-rhodf            CSI Addons                          4.18.11-rhodf         odf-csi-addons-operator.v4.18.10-rhodf            Succeeded
openshift-storage                                  odf-dependencies.v4.18.11-rhodf                   Data Foundation Dependencies        4.18.11-rhodf         odf-dependencies.v4.18.10-rhodf                   Succeeded
openshift-storage                                  odf-operator.v4.18.11-rhodf                       OpenShift Data Foundation           4.18.11-rhodf         odf-operator.v4.18.10-rhodf                       Succeeded
openshift-storage                                  odf-prometheus-operator.v4.18.11-rhodf            Prometheus Operator                 4.18.11-rhodf         odf-prometheus-operator.v4.18.10-rhodf            Succeeded
openshift-storage                                  recipe.v4.18.11-rhodf                             Recipe                              4.18.11-rhodf         recipe.v4.18.10-rhodf                             Succeeded
openshift-storage                                  rook-ceph-operator.v4.18.11-rhodf                 Rook-Ceph                           4.18.11-rhodf         rook-ceph-operator.v4.18.10-rhodf                 Succeeded

How reproducible:

1. Introduced 60s network latency for 15mins on 2 worker nodes in same ODF zone(with one worker hosting the MON pod)
2. the 2 worker nodes become unready
3  Some of the VMs on the worker node stuck at scheduling state, due to below issue
4. I was able to recreate the issue 2 out of 10 times.

Events:
  Type     Reason           Age                   From     Message
  ----     ------           ----                  ----     -------
  Warning  FailedMapVolume  4m9s (x121 over 10h)  kubelet  MapVolume.SetUpDevice failed for volume "pvc-b7db50ea-0637-4653-aec7-05d40201b4d8" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 108) occurred while running rbd args: [--id csi-rbd-node -m 172.30.137.156:3300,172.30.245.95:3300,172.30.86.193:3300 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a7c2f767-3838-4efa-9cf3-b23e02ef619b --device-type krbd --options noudev --options read_from_replica=localize,crush_location=host:e09-h12-000-r660|rack:rack1], rbd error output: rbd: sysfs write failed
rbd: map failed: (108) Cannot send after transport endpoint shutdown
  Warning  FailedMapVolume  3s (x187 over 10h)  kubelet  MapVolume.SetUpDevice failed for volume "pvc-b7db50ea-0637-4653-aec7-05d40201b4d8" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 108) occurred while running rbd args: [--id csi-rbd-node -m 172.30.137.156:3300,172.30.245.95:3300,172.30.86.193:3300 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a7c2f767-3838-4efa-9cf3-b23e02ef619b --device-type krbd --options noudev --options read_from_replica=localize,crush_location=rack:rack1|host:e09-h12-000-r660], rbd error output: rbd: sysfs write failed

Steps to Reproduce:

1.Introduced 60s network latency for 15mins on 2 worker nodes in same ODF zone(with one worker hosting the MON pod)
podman run --rm -e LABEL_SELECTOR="chaos=odf" -e INSTANCE_COUNT=1 -e DURATION=900 -e TRAFFIC_TYPE=egress  -e  EGRESS='{latency: 60000ms}' -e KUBECONFIG=/tmp/config  -e KRKN_KUBE_CONFIG=/tmp/config -e DISTRIBUTION='openshift' --net=host -v /tmp/config:/tmp/config:Z  quay.io/krkn-chaos/krkn-hub:network-chaos

Actual results:

Some of the VMs are stuck in scheduling state
virt-clone-clones   clone-vm-0-17    13h     Scheduling                                     False
virt-clone-clones   clone-vm-0-296   13h     Scheduling                                     False
virt-clone-clones   clone-vm-0-328   13h     Scheduling                                     False
virt-clone-clones   clone-vm-0-40    13h     Scheduling                                     False

virt-launcher-clone-vm-0-17-c9qhz    0/1     ContainerCreating   0          13h
virt-launcher-clone-vm-0-296-dzg4b   0/1     ContainerCreating   0          13h
virt-launcher-clone-vm-0-328-rls4j   0/1     ContainerCreating   0          13h
virt-launcher-clone-vm-0-40-897zb    0/1     ContainerCreating   0          13h

Expected results:

All the VMs(fedora VMs) should be in running state

Additional info:

links to

link to must gather logs

Assignee:: Alex Kalenyuk

Reporter:: Yogananth Subramanian

QA Contact:: Natalie Gavrielov

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2025/09/29 6:50 PM

Updated:: 2025/10/08 7:44 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates