Loading...

Type: Feature
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:

Activity Type:
Product / Portfolio Work
Parent Link:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Size:
None

Target Version:

openshift-4.21
Release Blocker:
None
Release Type:
Tech Preview

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Priority Data:
None
PX Impact Score:
PX Technical Impact:
None
PX Impact Range:
None
PX Scheduling Request:
None
PX Technical Impact Notes:
None

Intelligence Requested:
Market:

Feature Overview

This feature integrates the Ironic Prometheus Exporter (IPE) into the OpenShift Container Platform (OCP) Bare Metal for Day-1 and Day-2. It provides a unified observability experience by collecting critical hardware health and performance metrics from bare metal nodes and federating them to a central monitoring location. This gives administrators a single pane of glass to view both their hardware and CaaS metrics, enabling proactive maintenance, improved resource management, and enhanced reliability for bare metal clusters at scale.

Goals

The primary goal is to provide cluster administrators with the ability to monitor the hardware health of their bare metal nodes from a centralized Hub cluster.

Persona: Cluster Administrator, Infrastructure Engineer.
Functionality:
- Natively collect hardware metrics from bare metal nodes.
- Integrate hardware metrics into the standard OpenShift Monitoring stack.
- Federate metrics from all managed clusters to a Red Hat Advanced Cluster Management (RHACM) for OpenShift Hub, making them available in the central observability dashboard.

Extends: This feature expands the monitoring and management capabilities of Bare Metal solution provided by Red Hat.

Requirements

Functional Requirements:

The system must enable the Ironic Prometheus Exporter to collect metrics from the Baseboard Management Controllers (BMCs) of bare metal nodes.
The solution must support any Redfish-compliant hardware, with Dell and HPE servers (prio 1) as the priority one targets for validation and documentation.
As part of the implementation, an analysis must be performed to identify the full range of metrics available for collection from Dell and HPE hardware (prio 1). All identified metrics must be collected by default (if possible).
Metrics collected from managed clusters must be exposed to and be queryable from the RHACM Hub cluster's observability service.
The system must provide a default, recommended configuration for metric collection that works out-of-the-box.
A mechanism must be provided for users to customize the metric collection configuration (e.g., modify scrape targets, change intervals, or filter metrics).

Non-Functional Requirements:

Performance: The feature must be designed to minimize performance impact during the metric scraping process. The acceptable performance overhead and default scrape intervals must be defined and documented.
Scalability: The solution architecture must support collecting and federating metrics efficiently from hundreds or thousands of bare metal nodes across multiple managed clusters to a single Hub (currently one RHACM can manage up to 3500 SNO managed clusters). See more in https://docs.google.com/document/d/1ZTddj33EnEZcKo2F5wo5rD9LY8WgASDGZPn_FTtrHLI/edit?tab=t.0#heading=h.iuspr4gyu8yr
Usability: Hardware metrics should be seamlessly accessible within the existing RHACM observability UI, alongside standard cluster and application metrics.
Maintainability: The implementation should be extensible to easily support additional hardware vendors and new metrics in the future.

Use Case

User Story: "As a cluster administrator, I want to have the capability to see hardware metrics together with CaaS metrics on the level of a Hub Cluster, so that I can use them as a part of the Observability stack."

Questions to Answer

The following architectural and design questions must be answered before implementation can begin:

Configuration: How will a user enable this feature?
Performance Baseline: What is the default metric scrape interval, and what is the expected performance overhead? What specific tuning parameters will be available to the user to adjust this?
Architecture: What is the detailed data path for metrics from the IPE to the RHACM Hub's observability backend?
Customization: What is the precise mechanism for users to customize the metric collection configuration?
Metric Analysis: What is the complete list of metrics that will be collected from Dell and HPE servers (prio1) as a result of the initial analysis?