[OBSDA-939] OpenShift added value metrics - Red Hat Issue Tracker

Type: Feature
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: PM Tracing
Labels:
- CloudProviderIntegration
- ObservabilityVendorIntegration

Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Not Selected
PM Score:
0
Parent Link:
OBSDA-914Red Hat Observability Integrations
Hierarchy Progress Bar:

100% To Do, 0% In Progress, 0% Done

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

Background

OpenShift provides comprehensive metrics across its core platform, and OpenShift Insights adds further capabilities by offering proactive monitoring, diagnostics, and predictive maintenance. The purpose of this exploratory feature is to discover which of these OpenShift metrics, across both core platform and layered services, would add significant value when exported via OpenTelemetry collectors. The final goal is to configure the Red Hat build of OpenTelemetry to export these metrics, enabling third-party observability vendors to showcase them in dedicated OpenShift dashboards.

High-Level Goals

Identify high-value metrics from OpenShift core platform, as well as layered components such as Insights.
Provide configurations for OpenTelemetry collectors to retrieve and export these metrics.
Ensure that these metrics can be used by third-party observability vendors to build dashboards that highlight key OpenShift metrics.
Propose a set of metrics that offer meaningful insights for platform administrators and developers, facilitating platform troubleshooting, and capacity planning.

Requirements

Discovery of Key Metrics

investigate and identify essential metrics from the OpenShift Platform:
- Cluster health: Node status, control plane performance, etc.
- Resource utilization: CPU, memory, storage, and network usage across nodes and pods.
- Scaling and auto-scaling: Metrics related to horizontal and vertical scaling events.
- Pod lifecycle events: Metrics around pod scheduling, readiness, failures, and restarts.
Additionally, identify high-value metrics from OpenShift Insights, which may include:
- Anomalies and diagnostics: Metrics related to detected issues, risks, or misconfigurations in the cluster.
- Failure predictions: Metrics derived from predictive models identifying potential upcoming failures or performance degradation.
- Health checks and recommendations: Metrics indicating overall cluster health and suggested actions for improvement.
Focus on metrics that provide clear, actionable insights into platform stability, performance, and resilience.

Configuration of OpenTelemetry Collectors

Define configurations for OpenTelemetry collectors to retrieve and export these identified metrics.
- Ensure that collectors are set up to handle both platform-level and application-level metrics.
- Enable relevant receivers and exporters in OpenTelemetry that can forward these metrics to observability vendors.
- Review which OpenShift-specific components are compatible with OpenTelemetry’s current receivers and plan any necessary extensions.

Documentation

Provide detailed documentation on how to configure OpenTelemetry collectors to retrieve and export the discovered metrics.
- Ensure documentation includes links to relevant OpenTelemetry and OpenShift docs for further setup guidance.
- Add descriptions for each exported metric, detailing what it represents and how it can be used by observability vendors to build effective dashboards.

Stretch (not needed to close the feature): Vendor Dashboard Integration

Collaborate with third-party observability vendors to define how these metrics can be showcased in their dashboards.
- Provide a sample OpenShift dashboard configuration to vendors, showcasing essential metrics like:
- These dashboards may be expanded for more complex use cases like monitoring AI/ML workloads or multi-cloud deployments.

Details

Description

Background

High-Level Goals

Requirements

Attachments

Easy Agile Planning Poker

Activity

People

Dates