[DFBUGS-499] [2308598] Fencing may occur in parallel to pod eviction on nodes tainted with out-of-service annotation - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: odf-4.18
Affects Version/s: odf-4.16
Component/s: csi-addons
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2308598
Dev Approval:
Committed
QE Approval:
Committed
Release Note Type:
Release Note Not Required
Target Release:

odf-4.18
Intelligence Requested:
Market:

Release Blocker:
Proposed
Target Version:

odf-4.18

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests):

Sequence 1:
While handling non-graceful node shutdown [1] by Rook, if a node is tainted with the out-of-service taint Rook would,

Initiate a fence operation for the node
CSI addons would act on the fence and issue a blocklist to Ceph

Sequence 2:
While the above is in progress, kube will in parallel:

Force delete pods on the node
Issue a volume detach for any volumes attached on the node
The volume detach in the above sequence is a no-op in Ceph-CSI, which would hence result in kube garbage collecting the VolumeAttachment resource for the volume.

Overall due to the parallel nature of reconciliations the above sequences would be running in parallel. As a result kube may race ahead of actual Ceph fencing being completed, and map/mount the volume on a new node.

While Ceph-CSI relies on RBD watchers, which is cleared only when fencing is completed, there are cases that the RBD developers have raised where the watcher is not necessarily placed for every consumer of an image.

So unless the watcher is a reliable method the above 2 sequences can cause inadvertent map and mount of a volume on a new node, which may end up corrupting the volume contents IFF the out-of-service node continues to have access to write to the volume (e.g only kubelet crashed on the out-of-service node).

This needs to be ensured (i.e watchers are reliable in this scheme), or in cases where this procedure is prescribed (e.g stretched cluster HA or even regular ODF clusters), it should potentially be amended to first complete fencing of a node and then taint the node with the out-of-service taint.

Version of all relevant components (if applicable): All since this scheme was introduced

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

If the above can cause corruption, normal pod eviction and rescheduling for RBD backed RWO volumes would be at risk.

Is there any workaround available to the best of your knowledge?

Stated above "in cases where this procedure is prescribed (e.g stretched cluster HA or even regular ODF clusters), it should potentially be amended to first complete fencing of a node and then taint the node with the out-of-service taint."

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

Untested, discussed with Madhu and Niels to arrive at the potential problem.

Can this issue reproducible?

We could scale down the addons operator such that fencing does not complete, and check to see kube behavior and related RBD watcher behavior.

Can this issue reproduce from the UI? No

Additional info:

This BZ is opened to analyze the situation above and provide and corrective steps or take corrective actions in this regard.

links to

RHBA-2024:138027 Red Hat OpenShift Data Foundation 4.18 security, enhancement & bug fix update

Assignee:: Madhu R

Reporter:: Shyam Ranganathan

QA Contact:: Mahesh Shetty

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Created:: 2024/08/29 9:24 PM

Updated:: 2025/03/11 9:19 AM

Resolved:: 2025/03/11 9:19 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty

Hide