XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • Self Node Remediation
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      The SNR node remediation time can be long in case a node becomes unhealthy and the self-node-remediation-controller-manager is running on the unhealthy node.

      This delay will occur with the following commit: https://github.com/medik8s/self-node-remediation/commit/748e5742284dd672d91a553a205496189288b592

      This patch changes to the following behavior:

      [Before merging this patch]
      The SNR daemonset pod in each node tries to add/delete the out-of-service taint when a SNR object is created.

      [After merging this patch]
      Only the self-node-remediation-controller-manager pod and the SNR daemonset pod in the failed node tries to add/delete the out-of-service taint when a SNR object is created.

      We can't trust the failed node, so only the self-node-remediation-controller-manager will be able to remediate the failed node. So when the manager is being evicted, the node remediation is delayed.

      To avoid this remediation delay, the manager should be executed with 2 replicas.

              kkii@redhat.com Keiichi Kii
              kkii@redhat.com Keiichi Kii
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: