Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-2365

Improve Node Reliability and Control Plane Load by Managing Terminated Pods Proactively (upstream work)

XMLWordPrintable

    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Feature Overview (aka. Goal Summary)  

      OpenShift clusters running high-churn workloads (e.g., OpenShift Pipelines, GitOps, CI/CD jobs) often accumulate thousands of terminated pods and exited containers on worker nodes. This leads to:

      • kubelet failures due to gRPC message size limits (e.g., ListPodSandbox errors),
      • NotReady node states due to PLEG is not healthy,
      • Excessive memory usage by crio,
      • Slow API server performance and increased etcd load,
      • Operational risk and service degradation in production clusters.

      The current threshold for terminated pod garbage collection (terminated-pod-gc-threshold) is set to a high default (12,500), and the only way to modify it is through unsupported overrides. This proposal introduces a supported, configurable mechanism to automatically manage terminated pods per node and at the cluster level.


      ✅ Key Use Cases

      1. Prevent Node Failures Due to Excessive Exited Pods
        • Nodes become NotReady when gRPC message sizes exceed limits due to too many exited containers (e.g., pipelines with long annotations or massive pod counts).
      1. Avoid Kubelet PLEG Failures and Container Runtime Crashes
        • Accumulated terminated pods lead to frequent PLEG is not healthy and container runtime is down errors, degrading node health and triggering service disruption.
      1. Control etcd and API Server Load from Excess Pod Metadata
        • High numbers of terminated pods increase object counts in etcd and slow down the control plane, complicating troubleshooting and inflating cluster load.
      1. Replace Fragile Workarounds like CronJobs for Pod Cleanup
        • Customers currently rely on manual cleanup scripts or jobs to delete completed pods, which are error-prone and not scalable.
      1. Protect Multi-Tenant and High-Scale Environments
        • In large clusters or shared environments, a single workload (e.g., misconfigured Tekton pipelines) can produce thousands of terminated pods, risking node and cluster health.
      1. Support Configurable Policy for Terminated Pod Lifecycle
        • Customers request a tunable threshold via supported APIs (not via unsupportedConfigOverrides), e.g., with values based on cluster topology:
          .1 * node_count * maxPodsPerNode (min 1,000; max 19,000)
          with changes applied after 24 hours or if delta > 25%.

              gausingh@redhat.com Gaurav Singh
              gausingh@redhat.com Gaurav Singh
              None
              Ayato Tokubi
              None
              Aruna Naik Aruna Naik
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: