Loading...

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: rook
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2315666
Dev Approval:
Committed
Docs Approval:
?
PM Approval:
?
QE Approval:
?
Target Release:

odf-4.14.16
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem (please be detailed as possible and provide log
snippests):

Testing disaster recover with stretch cluster for OpenShift Data Foundation as defined here : https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.16/html/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/introduction-to-stretch-cluster-disaster-recovery_stretch-cluster

Installed a product with HA configuration and simulated disaster by shutting down ALL nodes for a Zone-2. After shutdown applied non-graceful node shutdown taint as defined here : https://kubernetes.io/blog/2023/08/16/kubernetes-1-28-non-graceful-node-shutdown-ga/ which triggered creation of a networkfence. The networkfence impacted recovery of application on Zone-1.

Network fence :
---------------
NAME DRIVER CIDRS FENCESTATE ...
xxxx openshift-storage.rbd.csi.ceph.com ["100.64.0.9/32"] Fenced ...

Datastore pods on surviving zone had issues mounting PVCs (read-only file system):

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Failed 6m9s (x4 over 7m21s) kubelet Error: relabel failed /var/lib/kubelet/pods/8820f331-64bc-47d1-9140-05d963f43634/volumes/kubernetes.io~csi/pvc-f892de38-e7c8-418f-9d64-cc1e958ca209/mount: lsetxattr /var/lib/kubelet/pods/8820f331-64bc-47d1-9140-05d963f43634/volumes/kubernetes.io~csi/pvc-f892de38-e7c8-418f-9d64-cc1e958ca209/mount: read-only file system
Warning Failed 4m52s kubelet Error: relabel failed /var/lib/kubelet/pods/8820f331-64bc-47d1-9140-05d963f43634/volumes/kubernetes.io~csi/pvc-f892de38-e7c8-418f-9d64-cc1e958ca209/mount: lsetxattr /var/lib/kubelet/pods/8820f331-64bc-47d1-9140-05d963f43634/volumes/kubernetes.io~csi/pvc-f892de38-e7c8-418f-9d64-cc1e958ca209/mount/conf: read-only file system

Deleted datastore pod on surviving zone and when recreated saw following :

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...
Warning FailedMount 21s (x9 over 2m29s) kubelet MountVolume.MountDevice failed for volume "pvc-5438fe65-a911-4226-9d1f-d3402af17cc9" : rpc error: code = Internal desc = error generating volume 0001-0011-openshift-storage-0000000000000001-71f3c614-f564-4814-8cbb-4a0247804fb7: rados: ret=-108, Cannot send after transport endpoint shutdown

Version of all relevant components (if applicable):
OCP 4.16.10
ODF Client : 4.17.0-98.stable
ODF Foundation : 4.170-98.stable

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)? yes - unable to recover in disaster scenario

Is there any workaround available to the best of your knowledge? no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)? 3 (due to custom product install steps)

Can this issue reproducible? likely

Can this issue reproduce from the UI? no

If this is a regression, please provide more details to justify this:

Steps to Reproduce:
1. Establish cluster in Stretch configuration
2. Install Product w/ HA configuration with datastores using RWO to access ocs-storagecluster-ceph-rbd storage
3. Simulate Zone outage for Zone-2 by shutting down all nodes
4. Add taint for non-graceful shutdown to Zone-2 nodes
`kubectl taint nodes <node-name> node.kubernetes.io/out-of-service=nodeshutdown:NoExecute`
5. Monitor kube resources on surviving zone - all should be able to access storage

Actual results:
Pods on surviving zone, unable to mount select PVCs :

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...
Warning FailedMount 21s (x9 over 2m29s) kubelet MountVolume.MountDevice failed for volume "pvc-5438fe65-a911-4226-9d1f-d3402af17cc9" : rpc error: code = Internal desc = error generating volume 0001-0011-openshift-storage-0000000000000001-71f3c614-f564-4814-8cbb-4a0247804fb7: rados: ret=-108, Cannot send after transport endpoint shutdown

Expected results:
Pods on surviving zone are able to mount PVCs

Additional info:

When Zone-2 was restored and the non-graceful taint removed, the mount error was resolved. All pods running successfully on surviving zone.

clones

DFBUGS-979 [Clone to 4.15][2315666] [Stretch cluster] Network Fence for non-graceful node shutdown taint blocked volume mount on surviving zone

POST

links to

red-hat-storage/rook#796: DFBUGS-980: csi: disable fencing in Rook

red-hat-storage/rook#798: DFBUGS-980: Revert "csi: disable fencing in Rook"

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty