-
Bug
-
Resolution: Unresolved
-
Critical
-
odf-4.16
-
None
Description of problem (please be detailed as possible and provide log
snippests):
Sequence 1:
While handling non-graceful node shutdown [1] by Rook, if a node is tainted with the out-of-service taint Rook would,
- Initiate a fence operation for the node
- CSI addons would act on the fence and issue a blocklist to Ceph
Sequence 2:
While the above is in progress, kube will in parallel:
- Force delete pods on the node
- Issue a volume detach for any volumes attached on the node
- The volume detach in the above sequence is a no-op in Ceph-CSI, which would hence result in kube garbage collecting the VolumeAttachment resource for the volume.
Overall due to the parallel nature of reconciliations the above sequences would be running in parallel. As a result kube may race ahead of actual Ceph fencing being completed, and map/mount the volume on a new node.
While Ceph-CSI relies on RBD watchers, which is cleared only when fencing is completed, there are cases that the RBD developers have raised where the watcher is not necessarily placed for every consumer of an image.
So unless the watcher is a reliable method the above 2 sequences can cause inadvertent map and mount of a volume on a new node, which may end up corrupting the volume contents IFF the out-of-service node continues to have access to write to the volume (e.g only kubelet crashed on the out-of-service node).
This needs to be ensured (i.e watchers are reliable in this scheme), or in cases where this procedure is prescribed (e.g stretched cluster HA or even regular ODF clusters), it should potentially be amended to first complete fencing of a node and then taint the node with the out-of-service taint.
Version of all relevant components (if applicable): All since this scheme was introduced
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
If the above can cause corruption, normal pod eviction and rescheduling for RBD backed RWO volumes would be at risk.
Is there any workaround available to the best of your knowledge?
Stated above "in cases where this procedure is prescribed (e.g stretched cluster HA or even regular ODF clusters), it should potentially be amended to first complete fencing of a node and then taint the node with the out-of-service taint."
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
Untested, discussed with Madhu and Niels to arrive at the potential problem.
Can this issue reproducible?
We could scale down the addons operator such that fencing does not complete, and check to see kube behavior and related RBD watcher behavior.
Can this issue reproduce from the UI? No
Additional info:
This BZ is opened to analyze the situation above and provide and corrective steps or take corrective actions in this regard.