Loading...

Type: Feature
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:

Activity Type:
Product / Portfolio Work
Parent Link:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Size:
None

Target Version:

openshift-4.22
Release Blocker:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Priority Data:
None
PX Impact Score:
PX Technical Impact:
None
PX Impact Range:
None
PX Scheduling Request:
None
PX Technical Impact Notes:
None

Intelligence Requested:
Market:

Feature Overview

Improving massive-scale Bare Metal as a Service (BMaaS) deployments by significantly increasing the supported number of managed bare metal nodes per Bare Metal Operator (BMO) instance within an OpenShift Container Platform (OCP) cluster. This enhancement provides Cluster Administrators with the necessary scale to manage large data center footprints, aligning OCP's Bare Metal provisioning capabilities with other hyperscale management platforms.

Goals

The observable functionality that the user gains is the ability to reliably manage a significantly larger number of bare metal hosts using a single BMO instance.

Primary User/Persona: Cluster Administrator (or Platform Engineer managing large-scale infrastructure).
Key Goal: To introduce the necessary performance and architectural changes in the BMO to support a validated scale target significantly beyond the current limits of 500–1,500 nodes (depending on the provisioning method: Virtual Media/Red Fish or network booting).
Target Alignment: To enable the management of up to 3,500 bare metal nodes per BMO, aligning with the scaling targets of management platforms like Advanced Cluster Management (ACM). New limits must be clearly defined in the documentation including all necessary assumptions (e.g. separate limits for VirtualMedia/Red Fish and network booting).
Expanded Functionality: This feature expands the existing Day 1 (initial provisioning) and Day 2 (firmware upgrade, configuration, and observability) management capabilities to operate reliably and performantly at the new scale limits.

Requirements

This feature requires the engineering team to confirm the feasibility of the 3,500-node target and define the specific metrics to achieve the new validated scale.

Functional Requirements (TBD by Engineering)

The engineering team must define and meet specific Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for key BMO operations at the new validated scale limit..

R1: Day 1 Provisioning Performance: The BMO must successfully provision at least 100 new bare metal hosts in parallel and integrate them into the OpenShift Container Platform within a clearly defined time window.
R2: Day 2 Management Performance (Firmware Upgrade): The BMO must successfully initiate and complete a full cycle of firmware upgrades and configuration changes across 200 managed hosts within a defined time frame .
R3: Observability Performance: The BMO must maintain accurate, near real-time status and health reporting for all managed nodes, without degradation of the OpenShift API server performance or significant delay in status updates. (Correlates with OCPSTRAT-2645).

Non-Functional Requirements

Scalability: The BMO architecture must be tuned to efficiently manage controller reconciliation loops, network I/O, and API load to reliably support the new validated node limit (up to 3,500 nodes).
Performance: All critical BMO operations must scale linearly or near-linearly, avoiding performance cliffs as the managed node count approaches the new limit.
Reliability/Error Rate: The failure rate for large-scale operations (e.g., provisioning or bulk configuration changes) must be minimal, with robust error handling and effective retry mechanisms to ensure a high success percentage (e.g., 99.5 success rate for provisioning at 3500 scale).
Documentation: New limits must be clearly defined in the documentation including all necessary assumptions (e.g. separate limits for VirtualMedia/Red Fish and network booting).

Use Case

As a Cluster Administrator, I want the Bare Metal Operator to reliably manage up to 3,500 BMaaS nodes in a single cluster so that I can utilize OpenShift Container Platform's management plane to operate my organization's largest bare metal data center footprint efficiently and cost-effectively.

Questions to Answer (Refinement/Architectural)

The following high-level architectural and planning questions must be answered by the engineering team to define the final scope for OpenShift 4.22:

What specific architectural bottlenecks have been identified that prevent scaling beyond the current limits (500-1,500 nodes)?
What is the maximum validated node limit achievable in the OpenShift 4.22 release for both Virtual Media/Red Fish and network booting methods?
What are the proposed SLIs/SLOs for R1, R2, and R3 (e.g., specific timeframes in hours for provisioning 100 nodes)?
What is the estimated memory and CPU footprint of the BMO controller plane at the new maximum scale?

Out of Scope

The following items are explicitly not included in the scope of this feature:

Adding new hardware vendor support or integrating new hardware-specific provisioning interfaces.

Links

Bare Metal as a Service (BMaaS) GA Feature: https://issues.redhat.com/browse/OCPSTRAT-2405
Observability at Scale Epic (Correlated): https://issues.redhat.com/browse/OCPSTRAT-2645
Existing Documentation (BMaaS): https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html-single/installing_on_bare_metal/index#bare-metal-using-bare-metal-as-a-service (Documentation for 4.22 will need updating to reflect new validated limits).
Super-scaling BMaaS: https://docs.google.com/document/d/1AkzRmm26Ds2Vg1OmWttHekDgbQSXH0xnWEzHoZt7XfE/edit?tab=t.0#heading=h.kniulhd3mgou

clones

OCPSTRAT-2378 [TP] Enable IPE (Ironic Prometheus Exporter) on OCP

Refinement

is blocked by

OCPSTRAT-2405 [GA] BareMetal as a Service Support for OpenShift

New

Details

Description

Feature Overview

Goals

Requirements

Functional Requirements (TBD by Engineering)

Non-Functional Requirements

Use Case

Questions to Answer (Refinement/Architectural)

Out of Scope

Links

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates