Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-943

[Tech-Preview]Native Karpenter with ROSA+HCP

XMLWordPrintable

    • BU Product Work
    • False
    • Hide

      None

      Show
      None
    • False
    • 92% To Do, 0% In Progress, 8% Done
    • 6
    • 0
    • Program Call

      Feature Overview (aka. Goal Summary)  

      As a cluster administrator, I want to use Karpenter on an OpenShift cluster running in AWS to scale nodes instead of Cluster Autoscalar(CAS). I want to automatically manage heterogeneous compute resources in my OpenShift cluster without the additional manual task of managing node pools. Additional features I want are:

      • Reducing cloud costs through instance selection and scaling/descaling
      • Support GPUs, spot instances, mixed compute types and other compute types.
      • Automatic node lifecycle management and upgrades

      This feature covers the work done to integrate upstream Karpenter 1.x with ROSA HCP. This eliminates the need for manual node pool management while ensuring cost-effective compute selection for workloads. Red Hat manages the node lifecycle and upgrades.

      The feature will be rolled out with ROSA (AWS) since it has more mature Karpenter ecosystem, followed by ARO (Azure) implementation(check OCPSTRAT-1498)

      Goals (aka. expected user outcomes)

      1. Run Karpenter in management cluster and disable CAS
      2. Automate node provisioning in workload cluster
      3. automate lifecycle management  in workload cluster
      4. Reduce cost in heterogenous compute workloads

      https://docs.google.com/document/d/1ID_IhXPpYY4K3G_wa1MYJxOb3yz5FYoOj3ONSkEDsZs/edit?tab=t.0#heading=h.yvv1wy2g0utk

      Requirements (aka. Acceptance Criteria):

      As a cluster-admin or SRE I should be able to configure Karpenter with OCP on AWS. Both cli and UI should enable users to configure Karpenter and disable CAS.

      1. Run Karpenter in management cluster and disable CAS
      2. OCM API 
        • Enable/Disable Cluster autoscaler
        • Enable/disable AutoNode feature
        • New ARN role configuration for Karpenter
        • Optional: New managed policy or integration with existing nodepool role permissions
      3. Expose NodeClass/Nodepool resources to users. 
      4. secure node provisioning and management, machine approval system for Karpenter instances
      5. HCP Karpenter cleanup/deletion support
      6. ROSA CAPI fields to enable/disable/configure Karpenter
      7. Write end-to-end tests for karpenter running on ROSA HCP

      Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

      Deployment considerations List applicable specific needs (N/A = not applicable)
      Self-managed, managed, or both managed ROSA HCP
      Classic (standalone cluster)  
      Hosted control planes yes
      Multi node, Compact (three node), or Single node (SNO), or all MNO
      Connected / Restricted Network Connected
      Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) x86_x64, ARM (aarch64)
      Operator compatibility  
      Backport needed (list applicable versions) No
      UI need (e.g. OpenShift Console, dynamic plugin, OCM) yes - console
      Other (please specify) rosa-cli

      Use Cases (Optional):

      Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

      <your text here>

      Questions to Answer (Optional):

      Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

      <your text here>

       

      Out of Scope

      High-level list of items that are out of scope.  Initial completion during Refinement status.

      • Supporting this feature in Standalone OCP/self-hosted HCP/ROSA classic
      • Creating a multi-provider cost/pricing operator compatible with CAPI is beyond the scope of this Feature. That may take more time.
      •  

      Background

      Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

      • Karpenter.sh is an open-source node provisioning project built for Kubernetes. It is designed to simplify Kubernetes infrastructure by automatically launching and terminating nodes based on the needs of your workloads. Karpenter can help you to reduce costs, improve performance, and simplify operations.
      • Karpenter works by observing the unscheduled pods in your cluster and launching new nodes to accommodate them. Karpenter can also terminate nodes that are no longer needed, which can help you save money on infrastructure costs.
      • Karpenter architecture has a Karpenter-core and Karpenter-provider as components. 
        The core has AWS code which does the resource calculation to reduce the cost by re-provisioning new nodes.

      Customer Considerations

      Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

      <your text here>

      Documentation Considerations

      Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

      • Migration guides from using CAS to Karpenter
      • Performance testing to compare CAS vs Karpenter on ROSA HCP
      • API documentation for NodePool and EC2NodeClass configuration

      Interoperability Considerations

      Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

      <your text here>

              rh-ee-smodeel Subin M
              rh-ee-smodeel Subin M
              Aaren de Jong, Alberto Garcia Lamela, Andrew Cathrow, Derek Carr, James Harrington, Joel Speed, Julio Faerman, Michael McCune, Subin M
              Russell Teague Russell Teague
              Jeana Routh Jeana Routh
              Michael McCune Michael McCune
              Subin M Subin M
              Votes:
              3 Vote for this issue
              Watchers:
              29 Start watching this issue

                Created:
                Updated: