-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
Feature Overview
This feature introduces the capability to capture, log, and correlate Redfish alert events from provisioned servers to their respective Bare Metal Host (BMH) resources within an OpenShift Container Platform (OCP) cluster. By logging these hardware events to stdout (standard output) in the relevant container, users will gain a comprehensive observability story for server hardware, complementing existing metrics and alerts.
Goals
- Primary Goal: Enable the logging of Redfish events from servers associated with a Bare Metal Host (BMH) resource.
- Anticipated Primary User: Cluster Administrators, Site Reliability Engineers (SREs), and Operations teams managing OCP deployments on Bare Metal.
- Expanded Functionality: Expands the observability features of the Bare Metal Operator (BMaaS) to include hardware-level event logs, integrating this data into the OCP logging infrastructure via the Cluster Logging Operator (CLO).
Requirements
Functional Requirements
- The Bare Metal Host provisioning component must register for Redfish alert events from the associated servers.
- The system must capture received Redfish events and log them to stdout (standard output) in the relevant container.
- The generated logs must contain metadata that correlates the events to the specific Bare Metal Host (BMH) / server that generated the log. This correlation must use the Kubernetes ID (UUID) of the Bare Metal Host.
- The system must record Redfish events for all hosts with a BMH, including those that are currently unused.
- The raw events must be simply dumped as logs into the container.
- A log configuration file must be provided to set retention parameters based on log file size and/or time.
- The events must be in a format that can be picked up and forwarded by the Cluster Logging Operator (CLO) to a configured destination.
Non-Functional Requirements
- Reliability: The logging mechanism must be robust and should not fail or block the provisioning process if a Redfish event cannot be processed or logged.
- Performance: The event capturing and logging process must have a minimal impact on the performance and resource utilization of the Bare Metal Operator components.
- Maintainability: The implementation must be part of the metal3 component and follow established OpenShift Container Platform (OCP) standards for configuration and logging.
- Security: Redfish communication channels must be secured (e.g., using TLS). Access to Redfish endpoints should follow principle of least privilege.
- Supported Hardware: Hardware with Redfish only.
- The feature must be verified to work correctly on the following hardware models:
- Priority 1 (Prio 1): Dell XR8620t, Dell XR8720t.
- Priority 2 (Prio 2): HPe DL110 Gen 11, HPe DL110 Gen12.
Use Case (Scenario)
- As a Cluster Administrator, I want to see hardware-level events (like CPU temperature warnings or power supply failures) in my central log feed, correlated to the specific Bare Metal Host (BMH), so that I can preemptively diagnose and address hardware issues before they impact workload scheduling or cluster stability.
- As an Operations Engineer, I want the installer component to log all Redfish events from servers with a BMH object, even if they are currently idle, so that I can verify the health of the entire inventory pool before a new workload attempts to provision a faulty machine.
Out of Scope
- Active alerting or notification based on the logged events (this is handled by separate monitoring/alerting systems).
- Automatic remediation or corrective action based on the logged event severity (e.g., automatically cordoning a node).
- Changes to the Bare Metal Host provisioning lifecycle or state transitions.
- Configuration of the Cluster Logging Operator (CLO) to forward these logs (the CLO is assumed to be the consuming component).
- Non-Redfish hardware.
Links
- Reference RFE: The scope is based on RFE-8428: Log Redfish events from servers.
- clones
-
OCPSTRAT-2686 [DP] Extend Metal3 Firmware Updates (Disk, RAID controllers, CPLD, and TPM)
-
- New
-