-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
Feature Overview
This feature introduces Hardware Health State Monitoring for OpenShift Bare Metal nodes, providing operators with immediate, high-level visibility into the physical health of their cluster's underlying infrastructure. By integrating with the Redfish standard through the existing Ironic and Metal3 components, the cluster can now report a simplified health status on the Bare Metal Host (BMH) resource, enabling faster diagnosis and response to hardware issues.
Goals
The primary goal is to provide a reliable and consumable hardware health status within the OpenShift Container Platform (OCP) management plane.
- Observable Functionality: Expose a new, consolidated health status field on the BareMetalHost custom resource (CR) for all managed Bare Metal nodes.
- Primary User: Operator.
- Benefit: The operator can swiftly identify nodes experiencing hardware degradation (e.g., disk failure, memory error) and attend to these failures appropriately and in a timely manner.
- Existing Feature Expansion: This feature expands the status reporting capabilities of the BareMetalHost object, specifically within the Bare Metal Operator.
- Scalability: This feature must be aligned with the scaling strategy for Bare Metal Operator.
Requirements
Functional Requirements
- Redfish Integration: The underlying bare metal management layer (Ironic/Metal3) must successfully retrieve hardware health data from the Baseboard Management Controller (BMC) using the Redfish protocol.
- Status Aggregation: The hardware health data (which includes metrics like Health and HealthRollup) must be ingested and aggregated into a high-level status (e.g. be able to be exported to Prometheus).
- BMH Status Exposure: The aggregated health status must be exposed as a new field (e.g., status.health) on the Bare Metal Host (BMH) custom resource.
- CLI Visibility: The oc get bmh command-line output must be updated to display the new health status field, allowing operators to quickly assess the cluster state.
- Status State Mapping: Define clear mapping for upstream Redfish status fields (e.g., Ironic's Health).
Nonfunctional Requirements
- Reliability: The health status must be consistently updated based on a defined polling interval, and status retrieval failures must be logged and handled gracefully without impacting the provisioned node's operational state.
- Performance: The polling mechanism for retrieving health status from the BMCs must be resource-efficient and not introduce significant latency or load on the OpenShift Control Plane or the BMC network.
- Scalability: This feature must be aligned with the scaling strategy for Bare Metal Operator.
- Security: Communication with the BMC (Redfish) must use secure protocols (HTTPS) and utilize the stored, encrypted credentials for authentication.
- Maintainability: The implementation should align with existing Metal3-io standards and be designed for easy maintenance and future expansion of detailed component health metrics.
- Documentation: Comprehensive documentation for the new health status field, its possible values, and operator guidance on responding to WARN/ERROR states must be created and published for the OCP release.
Use Case
A typical user scenario for this feature is as follows:
As an operator, I want to have visibility of hardware health state so that I can attend to failures appropriately and in a timely manner.
Questions to Answer (Optional)
- Detailed Health UX: What is the final User Experience (UX) and layout for exposing the health status in the OCP management tools (e.g., Console, CLI)? Specifically, should we adopt an additional column in the oc get bmh output or rely solely on the JSON/YAML status field?
- Future Granularity: The current scope is high-level health. Should the design include placeholder structures or API fields to facilitate the exposure of more detailed component health status (e.g., fan, power supply, disk array) in a subsequent release?
Out of Scope
Links
- Hardware Monitoring feature proposal
- Existing OCP Documentation (Bare Metal):
- Installation on Bare Metal:[ https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html-single/installing_on_bare_metal/index|https://docs.redhat.com/en/documentation/openshift_container_platform/4.20/html-single/installing_on_bare_metal/index]
- clones
-
OCPSTRAT-2644 [GA] Bare Metal Operator Support for RHEL 10
-
- New
-