Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1525

Enable autoscaler from/to zero on Hypershift

XMLWordPrintable

    • Product / Portfolio Work
    • None
    • 0% To Do, 100% In Progress, 0% Done
    • Hide
      • Color Status: Green
      • Status summary:
        • Jesse Jaggars (from HCM team) has done most of main function for API, Controller, RBAC. PR #6975 –  CNTRLPLANE-1622. Liangquan is reviewing the PR.
        • As Jesse is using a workround to annotate machinedeployment. We'd better to raise PR to upstream CAPA as permanent solution. Draft PR #5711.
      • Risks:
        • N/A
      Show
      Color Status: Green Status summary: Jesse Jaggars (from HCM team) has done most of main function for API, Controller, RBAC. PR #6975 –  CNTRLPLANE-1622 . Liangquan is reviewing the PR. As Jesse is using a workround to annotate machinedeployment. We'd better to raise PR to upstream CAPA as permanent solution. Draft PR #5711 . Risks: N/A
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • 9
    • None
    • None
    • None
    • None
    • None
    • None

      Feature Overview

      Enable autoscaling from/to zero for NodePools of Hosted Clusters, allowing customers to efficiently manage resources and reduce costs when compute is not needed.This can be done by having CAPI support to enable autoscaler from/to zero (which should be incorporated into the HyperShift operator logic)

      Goals

      • Allow HCP node pools to scale down to zero nodes when not in use
      • Enable autoscaling to automatically provision nodes when workloads require them
      • Provide a seamless experience for users managing compute resources in HCP clusters

      Primary user type: Cluster Service Consumers

      Expands on existing features: Node pool management and autoscaling capabilities

      Requirements

      1. Implement the ability to set min-replicas=0 and autoscale=y simultaneously for HCP node pools
      2. Ensure proper draining of nodes when scaling down to zero
      3. Handle degraded operators gracefully when no data plane compute is available
      4. Implement changes in the cluster-autoscaler machinery to support scaling from/to zero
      5. (optional/desired) Optimize performance to minimize the time required to scale up from zero

      Deployment considerations

      • Self-managed, managed, or both: works for both
      • Classic (standalone cluster): N/A
      • Hosted control planes: Applicable
      • Multi node, Compact (three node), or Single node (SNO), or all: N/A
      • Connected / Restricted Network: Both
      • CPU Architectures: all
      • Operator compatibility: Ensure compatibility with relevant operators
      • Backport needed: To be determined based on priority and release schedule
      • UI need: should be tracked in a separate OCM Jira

      Use Cases

      1. Cost optimization: Customers can scale down to zero nodes during off-hours or low-demand periods
      2. On-demand scaling: Automatically provision nodes when workloads require them
      3. Development and testing: Easily spin up and down compute resources for development and testing environments

      Questions to Answer

      1. What changes are required in the cluster-autoscaler machinery to support scaling from/to zero?
      2. How will we handle the transition from zero nodes to active autoscaling?
      3. What impact will this feature have on cluster startup time when scaling up from zero?
      4. Are there any potential security implications of allowing clusters to scale to zero nodes?

      Out of Scope

      • Implementing this feature for non-HCP clusters
      • Full Hibernation functionality, including the control-plane (as mentioned in the discussion)

      Background

      This feature is being requested to provide more flexibility and cost-efficiency for HCP cluster management. It builds upon the existing autoscaling capabilities and addresses limitations in current node pool management when combined with autoscaling.

      Customer Considerations

      • Provide clear documentation on the implications of scaling to zero (e.g., degraded operators)
      • Ensure a smooth user experience when transitioning between zero and active nodes
      • Consider potential impact on SLAs and cluster responsiveness

      Documentation Considerations

      • Create new documentation explaining how to enable and use autoscaling from/to zero
      • Update existing node pool and autoscaling documentation to include this new functionality
      • Provide best practices and considerations for using this feature

      Interoperability Considerations

      • Ensure compatibility with ROSA HCP
      • Verify interoperability with other OpenShift components and operators
      • Consider the impact on monitoring and logging systems when scaling to/from zero

              rhn-support-yli2 Yu Li
              agarcial@redhat.com Alberto Garcia Lamela
              None
              Michael McCune
              Cesar Wong Cesar Wong
              Liangquan Li Liangquan Li
              Wen Wang Wen Wang
              Matthew Werner Matthew Werner
              Senthamilarasu S Senthamilarasu S
              Votes:
              2 Vote for this issue
              Watchers:
              27 Start watching this issue

                Created:
                Updated: