Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-2971

[RFE] Handle cases where the node is removed in SNO using delete operation

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Done
    • Icon: Blocker Blocker
    • None
    • None
    • None
    • False
    • False
    • Undefined

      User Story

      As a user we want MCD to handle restarting the kubelet for the node to rejoin the cluster when user removes a node intentionally or unintentionally using delete operation in SNO in addition to preventing the user for doing so: https://github.com/kubernetes/enhancements/issues/2775.

      Background

      The node deletion operation in a Single Node OpenShift cluster using the client - $ oc delete node <node> leads to some of the cluster operators and application failures as there is there won't be any nodes for them to get scheduled on. This will impact the uptime of the application as well as well as a couple of system components in cases where this operation is run by the user intentionally or unintentionally. One way to recover is by restarting the kubelet on the node for it to get registered back to the cluster which in turn will enable the pods to get scheduled. While https://github.com/kubernetes/enhancements/issues/2775 will help preventing this operation from happening in the future, we need alert the user and MCD needs to understand the signal and handle the kubelet restart operation to avoid downtime if possible like we discussed in https://coreos.slack.com/archives/C018KQE33MF/p1626200735229900. Logs including the cluster operator status: http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/node-deletion/.

      Stakeholders

              gausingh@redhat.com Gaurav Singh
              nelluri Naga Ravi Chaitanya Elluri
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: