Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-978

[Clone to 4.16][2315666] [Stretch cluster] Network Fence for non-graceful node shutdown taint blocked volume mount on surviving zone

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • odf-4.16
    • rook
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Committed
    • ?
    • ?
    • ?
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):

      Testing disaster recover with stretch cluster for OpenShift Data Foundation as defined here : https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.16/html/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/introduction-to-stretch-cluster-disaster-recovery_stretch-cluster

      Installed a product with HA configuration and simulated disaster by shutting down ALL nodes for a Zone-2. After shutdown applied non-graceful node shutdown taint as defined here : https://kubernetes.io/blog/2023/08/16/kubernetes-1-28-non-graceful-node-shutdown-ga/ which triggered creation of a networkfence. The networkfence impacted recovery of application on Zone-1.

      Network fence :
      ---------------
      NAME DRIVER CIDRS FENCESTATE ...
      xxxx openshift-storage.rbd.csi.ceph.com ["100.64.0.9/32"] Fenced ...

      Datastore pods on surviving zone had issues mounting PVCs (read-only file system):

      Events:
      Type Reason Age From Message
      ---- ------ ---- ---- -------
      Warning Failed 6m9s (x4 over 7m21s) kubelet Error: relabel failed /var/lib/kubelet/pods/8820f331-64bc-47d1-9140-05d963f43634/volumes/kubernetes.io~csi/pvc-f892de38-e7c8-418f-9d64-cc1e958ca209/mount: lsetxattr /var/lib/kubelet/pods/8820f331-64bc-47d1-9140-05d963f43634/volumes/kubernetes.io~csi/pvc-f892de38-e7c8-418f-9d64-cc1e958ca209/mount: read-only file system
      Warning Failed 4m52s kubelet Error: relabel failed /var/lib/kubelet/pods/8820f331-64bc-47d1-9140-05d963f43634/volumes/kubernetes.io~csi/pvc-f892de38-e7c8-418f-9d64-cc1e958ca209/mount: lsetxattr /var/lib/kubelet/pods/8820f331-64bc-47d1-9140-05d963f43634/volumes/kubernetes.io~csi/pvc-f892de38-e7c8-418f-9d64-cc1e958ca209/mount/conf: read-only file system

      Deleted datastore pod on surviving zone and when recreated saw following :

      Events:
      Type Reason Age From Message
      ---- ------ ---- ---- -------
      ...
      Warning FailedMount 21s (x9 over 2m29s) kubelet MountVolume.MountDevice failed for volume "pvc-5438fe65-a911-4226-9d1f-d3402af17cc9" : rpc error: code = Internal desc = error generating volume 0001-0011-openshift-storage-0000000000000001-71f3c614-f564-4814-8cbb-4a0247804fb7: rados: ret=-108, Cannot send after transport endpoint shutdown

      Version of all relevant components (if applicable):
      OCP 4.16.10
      ODF Client : 4.17.0-98.stable
      ODF Foundation : 4.170-98.stable

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)? yes - unable to recover in disaster scenario

      Is there any workaround available to the best of your knowledge? no

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)? 3 (due to custom product install steps)

      Can this issue reproducible? likely

      Can this issue reproduce from the UI? no

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1. Establish cluster in Stretch configuration
      2. Install Product w/ HA configuration with datastores using RWO to access ocs-storagecluster-ceph-rbd storage
      3. Simulate Zone outage for Zone-2 by shutting down all nodes
      4. Add taint for non-graceful shutdown to Zone-2 nodes
      `kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute`
      5. Monitor kube resources on surviving zone - all should be able to access storage

      Actual results:
      Pods on surviving zone, unable to mount select PVCs :

      Events:
      Type Reason Age From Message
      ---- ------ ---- ---- -------
      ...
      Warning FailedMount 21s (x9 over 2m29s) kubelet MountVolume.MountDevice failed for volume "pvc-5438fe65-a911-4226-9d1f-d3402af17cc9" : rpc error: code = Internal desc = error generating volume 0001-0011-openshift-storage-0000000000000001-71f3c614-f564-4814-8cbb-4a0247804fb7: rados: ret=-108, Cannot send after transport endpoint shutdown

      Expected results:
      Pods on surviving zone are able to mount PVCs

      Additional info:

      When Zone-2 was restored and the non-graceful taint removed, the mount error was resolved. All pods running successfully on surviving zone.

              skrai Subham Rai
              morstad Nancy Heinz
              Santosh Pillai
              Votes:
              0 Vote for this issue
              Watchers:
              27 Start watching this issue

                Created:
                Updated: