-
Epic
-
Resolution: Unresolved
-
Blocker
-
None
-
Address SBR Design Gaps
-
False
-
-
False
-
To Do
-
RHWA-214 - SBR Operator
-
50% To Do, 50% In Progress, 0% Done
Summary This epic tracks the architectural refactoring of Storage Based Remediation (SBR) to decouple detection logic from remediation execution and resolve circular dependencies during node recovery. Currently, the design creates race conditions with Node Health Check (NHC) and prevents fenced nodes from verifying their health due to persistent taints that block necessary storage workloads from running.
The new architecture splits the remediation flow: healthy peers will now report storage failures via Node Conditions rather than triggering fencing directly, allowing NHC to arbitrate the decision. Additionally, the post-remediation workflow is updated to remove the remediation resource (and its associated taint) immediately after fencing, utilizing a grace period to allow storage verification mechanisms to confirm node recovery without triggering recursive fencing loops.
Here is a link for the detailed design that would be implemented in this epic