Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1896

Dynamic Scaling for Hosted Control Planes

XMLWordPrintable

    • BU Product Work
    • False
    • Hide

      None

      Show
      None
    • False
    • 100% To Do, 0% In Progress, 0% Done
    • 7
    • 0

      Problem

      The current Hypershift Operator scaling logic is based solely on the cluster node count. This approach does not accurately capture the real workload or resource pressure exerted on the control plane components, particularly the API server. It fails to consider metrics such as request patterns, payload sizes, and resource utilization, leading to suboptimal scaling decisions. This limitation often results in cluster service providers' manual overrides, which can be costly. 

      Feature Overview (aka. Goal Summary)

      The goal here is to expand the Hypershift Operator’s control plane scaling logic to incorporate more workload and resource metrics, such as API server request patterns, CPU utilization, and memory pressure on nodes, to adjust the control plane footprint dynamically. This is to improve cluster performance and avoid congestion when under load. Metrics can include:
       

      • Shape of API Server Requests: Patterns and distribution of incoming requests (e.g., burstiness, sustained high QPS, read vs. write mix).
      • Size of API Server Requests: Average and peak payload sizes that can affect processing time and resource usage.
      • CPU Utilization of Request-Serving Nodes: Monitoring and factoring in CPU saturation or headroom on the nodes that serve API requests to inform scaling decisions.
      • Memory Pressure on Request-Serving Nodes: Evaluating memory constraints on these nodes to prevent API server degradation due to memory exhaustion.

      Requirements (aka. Acceptance Criteria)

      • Functional Requirements:
        • Integration of API server request patterns (e.g., burstiness, QPS, read/write ratio) into scaling logic.
        • Consideration of API server payload sizes for dynamic scaling decisions.
        • Monitoring CPU utilization and memory pressure on nodes serving API requests.
        • Automated scaling triggers based on the aggregated metrics.
        • Consider/explore VPA/HPA for scaling

       

      • Nonfunctional Requirements:
        • Reliability: Scaling logic must operate seamlessly without impacting control plane availability.
        • Scalability: Support fleets of clusters with varying workloads.
        • Performance: Metrics gathering and processing should not introduce significant overhead.
        • Maintainability: Configuration and tuning of the scaling logic should be straightforward for operators.
        • Security: Metrics collection and usage must adhere to existing security protocols.

      Deployment Considerations

      Deployment Scenario Applicable Needs
      Self-managed, managed, or both Both
      Classic (standalone cluster) N/A
      Hosted control planes Yes
      Multi node, Compact, or SNO No
      Connected / Restricted Network Both
      Architectures (x86_x64, ARM, etc.) x86_x64, ARM, IBM Power, IBM Z
      Operator compatibility N/A
      Backport needed Yes (list applicable versions)
      UI need N/A
      Other N/A

      Use Cases (Optional)

      • Peak Traffic Handling: A control plane experiences high API request rates, triggering dynamic scaling to maintain responsiveness.
      • Cost Optimization: Scaling decisions prioritize cost-efficient instance types without compromising performance.
      • Proactive Degradation Prevention: Scaling adjusts preemptively based on memory or CPU pressure trends.

      Questions to Answer (Optional)

      • What thresholds should trigger scaling decisions based on API request patterns?
      • How will this feature handle noisy metrics or temporary spikes?
      • Can existing metrics pipelines (e.g., Prometheus) be reused without performance trade-offs?

      Out of Scope

      • Modifications to non-Hypershift control plane components.
      • Integration with third-party scaling solutions not supported by Hypershift.

      Documentation Considerations

      • Provide updated Hypershift Operator documentation detailing new scaling logic and configurable parameters.
      • Include examples and best practices for tuning metrics-based scaling.
      • Document troubleshooting guidelines for common scaling issues.

       

              Unassigned Unassigned
              azaalouk Adel Zaalouk
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: