-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
Feature Overview
Improving massive-scale Bare Metal as a Service (BMaaS) deployments by significantly increasing the supported number of managed bare metal nodes per Bare Metal Operator (BMO) instance within an OpenShift Container Platform (OCP) cluster. This enhancement provides Cluster Administrators with the necessary scale to manage large data center footprints, aligning OCP's Bare Metal provisioning capabilities with other hyperscale management platforms.
Goals
The observable functionality that the user gains is the ability to reliably manage a significantly larger number of bare metal hosts using a single BMO instance.
- Primary User/Persona: Cluster Administrator (or Platform Engineer managing large-scale infrastructure).
- Key Goal: To introduce the necessary performance and architectural changes in the BMO to support a validated scale target significantly beyond the current limits of 500–1,500 nodes (depending on the provisioning method: Virtual Media/Red Fish or network booting).
- Target Alignment: To enable the management of up to 3,500 bare metal nodes per BMO, aligning with the scaling targets of management platforms like Advanced Cluster Management (ACM). New limits must be clearly defined in the documentation including all necessary assumptions (e.g. separate limits for VirtualMedia/Red Fish and network booting).
- Expanded Functionality: This feature expands the existing Day 1 (initial provisioning) and Day 2 (firmware upgrade, configuration, and observability) management capabilities to operate reliably and performantly at the new scale limits.
Requirements
This feature requires the engineering team to confirm the feasibility of the 3,500-node target and define the specific metrics to achieve the new validated scale.
Functional Requirements (TBD by Engineering)
The engineering team must define and meet specific Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for key BMO operations at the new validated scale limit..
- R1: Day 1 Provisioning Performance: The BMO must successfully provision at least 100 new bare metal hosts in parallel and integrate them into the OpenShift Container Platform within a clearly defined time window.
- R2: Day 2 Management Performance (Firmware Upgrade): The BMO must successfully initiate and complete a full cycle of firmware upgrades and configuration changes across 200 managed hosts within a defined time frame .
- R3: Observability Performance: The BMO must maintain accurate, near real-time status and health reporting for all managed nodes, without degradation of the OpenShift API server performance or significant delay in status updates. (Correlates with OCPSTRAT-2645).
Non-Functional Requirements
- Scalability: The BMO architecture must be tuned to efficiently manage controller reconciliation loops, network I/O, and API load to reliably support the new validated node limit (up to 3,500 nodes).
- Performance: All critical BMO operations must scale linearly or near-linearly, avoiding performance cliffs as the managed node count approaches the new limit.
- Reliability/Error Rate: The failure rate for large-scale operations (e.g., provisioning or bulk configuration changes) must be minimal, with robust error handling and effective retry mechanisms to ensure a high success percentage (e.g., 99.5 success rate for provisioning at 3500 scale).
- Documentation: New limits must be clearly defined in the documentation including all necessary assumptions (e.g. separate limits for VirtualMedia/Red Fish and network booting).
Use Case
As a Cluster Administrator, I want the Bare Metal Operator to reliably manage up to 3,500 BMaaS nodes in a single cluster so that I can utilize OpenShift Container Platform's management plane to operate my organization's largest bare metal data center footprint efficiently and cost-effectively.
Questions to Answer (Refinement/Architectural)
The following high-level architectural and planning questions must be answered by the engineering team to define the final scope for OpenShift 4.22:
- What specific architectural bottlenecks have been identified that prevent scaling beyond the current limits (500-1,500 nodes)?
- What is the maximum validated node limit achievable in the OpenShift 4.22 release for both Virtual Media/Red Fish and network booting methods?
- What are the proposed SLIs/SLOs for R1, R2, and R3 (e.g., specific timeframes in hours for provisioning 100 nodes)?
- What is the estimated memory and CPU footprint of the BMO controller plane at the new maximum scale?
Out of Scope
The following items are explicitly not included in the scope of this feature:
- Adding new hardware vendor support or integrating new hardware-specific provisioning interfaces.
Links
- Bare Metal as a Service (BMaaS) GA Feature: https://issues.redhat.com/browse/OCPSTRAT-2405
- Observability at Scale Epic (Correlated): https://issues.redhat.com/browse/OCPSTRAT-2645
- Existing Documentation (BMaaS): https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html-single/installing_on_bare_metal/index#bare-metal-using-bare-metal-as-a-service (Documentation for 4.22 will need updating to reflect new validated limits).
- Super-scaling BMaaS: https://docs.google.com/document/d/1AkzRmm26Ds2Vg1OmWttHekDgbQSXH0xnWEzHoZt7XfE/edit?tab=t.0#heading=h.kniulhd3mgou
- clones
-
OCPSTRAT-2378 [TP] Enable IPE (Ironic Prometheus Exporter) on OCP
-
- Refinement
-
- is blocked by
-
OCPSTRAT-2405 [GA] BareMetal as a Service Support for OpenShift
-
- New
-