Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-70055

virt-launcher pod fails to MapVolume after node outage

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • Storage Platform
    • Quality / Stability / Reliability
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None

      Description of problem:

      virt-launcher pod fails to MapVolume after node outage(created by introducing  60 seconds of network latency for 15mins) causes the VMI to be stuck in scheduling state.
      I was able to recreate the issue 2 out of 10 times.
      

      Version-Release number of selected component (if applicable):

      openshift-cnv                                      kubevirt-hyperconverged-operator.v4.18.13         OpenShift Virtualization            4.18.13               kubevirt-hyperconverged-operator.v4.18.11         Succeeded
      openshift-ovirt-infra                              node-healthcheck-operator.v0.10.0                 Node Health Check Operator          0.10.0                node-healthcheck-operator.v0.9.1                  Succeeded
      openshift-storage                                  cephcsi-operator.v4.18.11-rhodf                   CephCSI operator                    4.18.11-rhodf         cephcsi-operator.v4.18.10-rhodf                   Succeeded
      openshift-storage                                  mcg-operator.v4.18.11-rhodf                       NooBaa Operator                     4.18.11-rhodf         mcg-operator.v4.18.10-rhodf                       Succeeded
      openshift-storage                                  ocs-client-operator.v4.18.11-rhodf                OpenShift Data Foundation Client    4.18.11-rhodf         ocs-client-operator.v4.18.10-rhodf                Succeeded
      openshift-storage                                  ocs-operator.v4.18.11-rhodf                       OpenShift Container Storage         4.18.11-rhodf         ocs-operator.v4.18.10-rhodf                       Succeeded
      openshift-storage                                  odf-csi-addons-operator.v4.18.11-rhodf            CSI Addons                          4.18.11-rhodf         odf-csi-addons-operator.v4.18.10-rhodf            Succeeded
      openshift-storage                                  odf-dependencies.v4.18.11-rhodf                   Data Foundation Dependencies        4.18.11-rhodf         odf-dependencies.v4.18.10-rhodf                   Succeeded
      openshift-storage                                  odf-operator.v4.18.11-rhodf                       OpenShift Data Foundation           4.18.11-rhodf         odf-operator.v4.18.10-rhodf                       Succeeded
      openshift-storage                                  odf-prometheus-operator.v4.18.11-rhodf            Prometheus Operator                 4.18.11-rhodf         odf-prometheus-operator.v4.18.10-rhodf            Succeeded
      openshift-storage                                  recipe.v4.18.11-rhodf                             Recipe                              4.18.11-rhodf         recipe.v4.18.10-rhodf                             Succeeded
      openshift-storage                                  rook-ceph-operator.v4.18.11-rhodf                 Rook-Ceph                           4.18.11-rhodf         rook-ceph-operator.v4.18.10-rhodf                 Succeeded
      

      How reproducible:

      1. Introduced 60s network latency for 15mins on 2 worker nodes in same ODF zone(with one worker hosting the MON pod)
      2. the 2 worker nodes become unready
      3  Some of the VMs on the worker node stuck at scheduling state, due to below issue
      4. I was able to recreate the issue 2 out of 10 times.
      
      Events:
        Type     Reason           Age                   From     Message
        ----     ------           ----                  ----     -------
        Warning  FailedMapVolume  4m9s (x121 over 10h)  kubelet  MapVolume.SetUpDevice failed for volume "pvc-b7db50ea-0637-4653-aec7-05d40201b4d8" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 108) occurred while running rbd args: [--id csi-rbd-node -m 172.30.137.156:3300,172.30.245.95:3300,172.30.86.193:3300 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a7c2f767-3838-4efa-9cf3-b23e02ef619b --device-type krbd --options noudev --options read_from_replica=localize,crush_location=host:e09-h12-000-r660|rack:rack1], rbd error output: rbd: sysfs write failed
      rbd: map failed: (108) Cannot send after transport endpoint shutdown
        Warning  FailedMapVolume  3s (x187 over 10h)  kubelet  MapVolume.SetUpDevice failed for volume "pvc-b7db50ea-0637-4653-aec7-05d40201b4d8" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 108) occurred while running rbd args: [--id csi-rbd-node -m 172.30.137.156:3300,172.30.245.95:3300,172.30.86.193:3300 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-a7c2f767-3838-4efa-9cf3-b23e02ef619b --device-type krbd --options noudev --options read_from_replica=localize,crush_location=rack:rack1|host:e09-h12-000-r660], rbd error output: rbd: sysfs write failed
      
      
      

      Steps to Reproduce:

      1.Introduced 60s network latency for 15mins on 2 worker nodes in same ODF zone(with one worker hosting the MON pod)
      podman run --rm -e LABEL_SELECTOR="chaos=odf" -e INSTANCE_COUNT=1 -e DURATION=900 -e TRAFFIC_TYPE=egress  -e  EGRESS='{latency: 60000ms}' -e KUBECONFIG=/tmp/config  -e KRKN_KUBE_CONFIG=/tmp/config -e DISTRIBUTION='openshift' --net=host -v /tmp/config:/tmp/config:Z  quay.io/krkn-chaos/krkn-hub:network-chaos 
      
      

      Actual results:

      Some of the VMs are stuck in scheduling state
      virt-clone-clones   clone-vm-0-17    13h     Scheduling                                     False
      virt-clone-clones   clone-vm-0-296   13h     Scheduling                                     False
      virt-clone-clones   clone-vm-0-328   13h     Scheduling                                     False
      virt-clone-clones   clone-vm-0-40    13h     Scheduling                                     False
      
      virt-launcher-clone-vm-0-17-c9qhz    0/1     ContainerCreating   0          13h
      virt-launcher-clone-vm-0-296-dzg4b   0/1     ContainerCreating   0          13h
      virt-launcher-clone-vm-0-328-rls4j   0/1     ContainerCreating   0          13h
      virt-launcher-clone-vm-0-40-897zb    0/1     ContainerCreating   0          13h
      
      
      

      Expected results:

      All the VMs(fedora VMs) should be in running state
      
      

      Additional info:

      
      

              akalenyu Alex Kalenyuk
              ysubrama@redhat.com Yogananth Subramanian
              Natalie Gavrielov Natalie Gavrielov
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: