-
Epic
-
Resolution: Unresolved
-
Major
-
None
-
None
Background
Team Medik8s provides a high-availability solution to automate the "healing" of Kubernetes clusters. The solution is mainly composed of an NHC operator for detecting the nodes’ health, and several remediator operators (SNR, FAR, MDR, and SBR) to fence the node from the cluster and remediate it from an unhealthy state to a healthy one.
Even though each remediator works differently, there is quite a similarity in the flow between FAR, SNR, and SBR (MDR is a unique use case) in how they perform cordon at the beginning of their remediation:
- FAR adds a custom medik8s.io/fence-agents-remediation:NoExecute taint (see RHWA-311 for why it is suggested to use the NoSchedule effect) on the node.
- SNR (and SBR) adds a different custom medik8s.io/remediation=self-node-remediation:NoExecute{} taint, and then modifies the Node's spec as Unschedulable so that Kubernetes can append the node.kubernetes.io/unschedulable:NoSchedule taint.
Solution
Both of the used custom taints have a different API (different key) and a wrong effect
- FAR should use the same taint key but change the effect (i.e., remediation.medik8s.io/fence-agents-remediation:NoSchedule)
- SNR should change the taint key and effect (i.e.,remediation.medik8s.io/self-node-remediation:NoSchedule)
- SNR should stop marking the node as unscheduleable by updating the spec (i.e., implicitly trigger K8s taint)
- Watch out for any reference to the old taint key/effect, so it will be updated as part of this epic
For more, see design/discussion at https://docs.google.com/document/d/12FMnb6RSs2iZFX7BldpY3BjlpBy0uSzv15hTxPb3iOg/edit?tab=t.0 and in two Slack threads https://redhat-internal.slack.com/archives/C03M5GKJNBA/p1767710450933989 and https://redhat-internal.slack.com/archives/C03M5GKJNBA/p1765956370373569.