Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-8033

[RFE] Exposing a configurable `node-monitor-grace-period` parameter to avoid workload pods running on worker nodes from terminating or un-necessarily restarting when connection to kube-API is restored

XMLWordPrintable

    • None
    • Quality / Stability / Reliability
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      I am raising this RFE to improve the behavior of OCP in a specific use-case 
      
      The state of different worker nodes is controlled via the kube-controller-manager through `node-monitor-grace-period` defaulting to 40s & `pod-eviction-timeout` defaulting to 5mins and also the kubeletConfig through `node-status-update-frequency` & `node-status-report-frequency`.
      
      
      According to Network separation with remote workers[1], If the kube controller loses contact with a node after a configured period, the node controller on the control plane updates the node health to Unhealthy and marks the node Ready condition as Unknown. 
      In response, the scheduler stops scheduling pods to that node. The on-premise node controller adds a node.kubernetes.io/unreachable taint with a NoExecute effect to the node and schedules pods on the node for eviction after five minutes, by default.
      If a workload controller, such as a Deployment object or StatefulSet object, is directing traffic to pods on the unhealthy node and other nodes can reach the cluster, OpenShift Container Platform routes the traffic away from the pods on the node. Nodes that cannot reach the cluster do not get updated with the new traffic routing. As a result, the workloads on those nodes might continue to attempt to reach the unhealthy node. 
      
      
      The `node-status-update-frequency` parameter works with the `node-monitor-grace-period` parameter. The `node-monitor-grace-period` parameter specifies how long OpenShift Container Platform waits after a node associated with a MachineConfig object is marked Unhealthy if the controller manager does not receive the node heartbeat. Workloads on the node continue to run after this time (I believe within the grace period and not after). 
      If the remote worker node rejoins the cluster after `node-monitor-grace-period` expires, pods continue to run. New pods can be scheduled to that node. The `node-monitor-grace-period` interval is 40s. 
      The `node-status-update-frequency` value must be lower than the node-monitor-grace-period value.
      
      
      This is suboptimal behavior within the lines of hosted Control Planes with Baremetal, Hosted Control Planes with Remote Virtualization Infrastructure and Remote edge scenarios.
      
      
      In the case of hosted control planes, the impact is cascaded: Assume the control plane pods of a hosted cluster running on a namespace of one of the management cluster worker nodes are operating fine. However, the management cluster Kube-API for some reason is isolated from the worker nodes; the control-plane-pods of the tenant (hosted) cluster can communicate with their remote baremetal worker nodes through different interfaces. However, we are faced with undesired scenarios. The first worst-case scenario, is kube-controller Terminating these pods. The second less severe scenario, when the connection is restored and assuming the hosted cluster pods were running, the kube-controller will re-cycle all pods. In return, a cascading effect might start with the bare-metal remote workers of the hosted control plane will be rebooted as well or even lost with their applications.
      
      
      Hosted Control Planes with Remote Virtualization Infrastructure is no better, although a remote infrastructure is used that could be fully operational, this might not be leveraged and the loss of connection will cause the hosted cluster virtualized workers to be rebooted. (I am not sure of this one but I assume they will be de-scheduled)
      
      
      [1]: https://docs.redhat.com/en/documentation/openshift_container_platform/4.19/html/nodes/remote-worker-nodes-on-the-network-edge#nodes-edge-remote-workers-network_nodes-edge-remote-workers

      Version-Release number of selected component (if applicable):

       

      How reproducible:

        Inherent behavior to all OCP releases since OCP 4.0

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          

      Expected results:

          

      Additional info:

          

              racedoro@redhat.com Ramon Acedo
              azaky@redhat.com Ahmed Zaky
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                None
                None