-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
Description of problem (please be detailed as possible and provide log
snippests):
Testing disaster recover with stretch cluster for OpenShift Data Foundation as defined here : https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.16/html/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/introduction-to-stretch-cluster-disaster-recovery_stretch-cluster
Installed a product with HA configuration and simulated disaster by shutting down ALL nodes for a Zone-2. After shutdown applied non-graceful node shutdown taint as defined here : https://kubernetes.io/blog/2023/08/16/kubernetes-1-28-non-graceful-node-shutdown-ga/ which triggered creation of a networkfence. The networkfence impacted recovery of application on Zone-1.
Network fence :
---------------
NAME DRIVER CIDRS FENCESTATE ...
xxxx openshift-storage.rbd.csi.ceph.com ["100.64.0.9/32"] Fenced ...
Datastore pods on surviving zone had issues mounting PVCs (read-only file system):
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Failed 6m9s (x4 over 7m21s) kubelet Error: relabel failed /var/lib/kubelet/pods/8820f331-64bc-47d1-9140-05d963f43634/volumes/kubernetes.io~csi/pvc-f892de38-e7c8-418f-9d64-cc1e958ca209/mount: lsetxattr /var/lib/kubelet/pods/8820f331-64bc-47d1-9140-05d963f43634/volumes/kubernetes.io~csi/pvc-f892de38-e7c8-418f-9d64-cc1e958ca209/mount: read-only file system
Warning Failed 4m52s kubelet Error: relabel failed /var/lib/kubelet/pods/8820f331-64bc-47d1-9140-05d963f43634/volumes/kubernetes.io~csi/pvc-f892de38-e7c8-418f-9d64-cc1e958ca209/mount: lsetxattr /var/lib/kubelet/pods/8820f331-64bc-47d1-9140-05d963f43634/volumes/kubernetes.io~csi/pvc-f892de38-e7c8-418f-9d64-cc1e958ca209/mount/conf: read-only file system
Deleted datastore pod on surviving zone and when recreated saw following :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...
Warning FailedMount 21s (x9 over 2m29s) kubelet MountVolume.MountDevice failed for volume "pvc-5438fe65-a911-4226-9d1f-d3402af17cc9" : rpc error: code = Internal desc = error generating volume 0001-0011-openshift-storage-0000000000000001-71f3c614-f564-4814-8cbb-4a0247804fb7: rados: ret=-108, Cannot send after transport endpoint shutdown
Version of all relevant components (if applicable):
OCP 4.16.10
ODF Client : 4.17.0-98.stable
ODF Foundation : 4.170-98.stable
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)? yes - unable to recover in disaster scenario
Is there any workaround available to the best of your knowledge? no
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 3 (due to custom product install steps)
Can this issue reproducible? likely
Can this issue reproduce from the UI? no
If this is a regression, please provide more details to justify this:
Steps to Reproduce:
1. Establish cluster in Stretch configuration
2. Install Product w/ HA configuration with datastores using RWO to access ocs-storagecluster-ceph-rbd storage
3. Simulate Zone outage for Zone-2 by shutting down all nodes
4. Add taint for non-graceful shutdown to Zone-2 nodes
`kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute`
5. Monitor kube resources on surviving zone - all should be able to access storage
Actual results:
Pods on surviving zone, unable to mount select PVCs :
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...
Warning FailedMount 21s (x9 over 2m29s) kubelet MountVolume.MountDevice failed for volume "pvc-5438fe65-a911-4226-9d1f-d3402af17cc9" : rpc error: code = Internal desc = error generating volume 0001-0011-openshift-storage-0000000000000001-71f3c614-f564-4814-8cbb-4a0247804fb7: rados: ret=-108, Cannot send after transport endpoint shutdown
Expected results:
Pods on surviving zone are able to mount PVCs
Additional info:
When Zone-2 was restored and the non-graceful taint removed, the mount error was resolved. All pods running successfully on surviving zone.
- clones
-
DFBUGS-979 [Clone to 4.15][2315666] [Stretch cluster] Network Fence for non-graceful node shutdown taint blocked volume mount on surviving zone
-
- POST
-
- links to