-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
None
In a use case where escalation remediation is set where FAR/SNR is the first remediator, the second remediator is a user-based remediator and the following has occurred:
- On node failure, NHC creates a FAR/SNR remediation
- FAR/SNR fails to remediate in the allocated time
- NHC time-out first remediator (FAR/SNR)
SNR removed the finalizer but not the added taints or FAR just stopped the running agent - NHC triggers a custom remediation
- The second remediator remediates the node and fails in the process
- NHC time-out second remediator
FAR/SNR added remediation taint (e.g., medik8s.io/fence-agents-remediation or medik8s.io/remediation) and out-of-service taint (on OutOfServiceTaint remediationStrategy), but they were not cleaned/removed eventually.
Having these taints at the end of remediation failed attempts could lead to some unsatisfied users' experience (the taints use NoExecute action, and the node would have limited pod availability), and it might be harder to continue troubleshooting the failed node.
NHC can handle these known taints on failed escalation remediation:
- After each remediation- If two remediators assess (and add/remove) taints, then it could lead to race condition problems unless NHC pauses remediation for some time to reduce this chance.
- After time-out to the last remediator - NHC goes over known and possible taints (or other (if any) resources) and deletes them followed by emitting a new event ("Failed Remediation"), and an annotation on the unhealthy node (with timestamp/attempts). The event could be easily tracked by Prometheus or a watch for an event, and the annotation will be used to "signal" the old failed nhc attempt
If the second remediator is again FAR/SNR then it seems ok to keep the taints as an indication that the node is still unhealthy but it comes with the cost of the taints effect when the user might wish to avoid having them after a failed attempt by NHC.
Related Slack discussion - https://redhat-internal.slack.com/archives/C03M5GKJNBA/p1736162948282599
- is triggered by
-
RHWA-277 SNR CR isn't removed when node is deleted by MDR
-
- Closed
-