Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-4894

Add per-node terminated pod eviction threshold to the pod garbage collector

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • None
    • Node
    • False
    • None
    • False
    • Not Selected

      1. Proposed title of this feature request

      Add per-node terminated pod eviction threshold to the pod garbage collector

      2. What is the nature and description of the request?

      We are requesting to add some kube-controller manager flag similar to terminated-pod-gc-threshold but "per-node", i.e. that deletes terminated pods on a concrete node if they grow higher than the threshold. This flag should work in addition to the already existing terminated-pod-gc-threshold.

      Recommended default threshold should be on the order of max pods, no more than double (reason for it explained below).

      3. Why does the customer need this? (List the business requirements here)

      If the number of exited containers goes above some threshold, kubelets end up KO because of the too big gRPC response returned via CRI socket.

      The global terminated-pod-gc-threshold cluster-wide limit is not enough, because we can have less pods than the threshold cluster-wide and yet have enough on a concrete node to overwhelm it. This is why we need a "per-node" one.

      Increase the max gRPC size is also not an option: this was done in the past, but we cannot increase it forever.

      In addition:

      • We cannot rely on - - maximum-dead-containers and/or - - maximum-dead-containers-per-container in the long term because those are deprecated (there is no clear date to remove them in a future version as of now, because it depends on some goals that have not been prioritized yet upstream, but they would eventually be cleaned up one day).
      • We cannot rely on current container eviction mechanisms because those work only based on storage consumption thresholds, while the gRPC overload happens due to just the number of containers (create enough pods with small enough containers and storage-consumption-based eviction can be avoided).
      • We cannot rely on max-pods because those don't include terminated pods.
      • We cannot rely on users good will and/or being careful, because we need to protect against intentional DOS.

      Last but not least: The reason to suggest the default threshold to be in the same order than max-pods is because the goal of this threshold is to protect the individual kubelets, not to protect the kube-apiserver from an excessive number of API pod objects.

      4. List any affected packages or components.

      • kube-controller-manager
      • cluster-kube-apiserver-operator

            gausingh@redhat.com Gaurav Singh
            rhn-support-palonsor Pablo Alonso Rodriguez
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated: