Loading...

XML

Word

Printable

Type: Outcome
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Not Selected
PM Score:
0
Hierarchy Progress Bar:

71% To Do, 29% In Progress, 0% Done

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Experience:
Market:

Goal: Enable observability for the Red Hat AI platforms by exposing accelerator metrics, integrating with OpenShift Lightspeed, defining OpenTelemetry schemantic convetion for GenAI workloads, and ensuring seamless monitoring of AI workloads with Red Hat partners (Dynatrace).

Key Deliverables:

Accelerator Metrics:

Implement visibility into GPU, TPU, and other accelerator performance metrics, including utilization, memory, power consumption, and efficiency.
Standardize metrics collection across various AI accelerators to ensure consistency in monitoring.
Expose these metrics within OpenShift Observe UI and for internal use within TeleSense.

Integration with OpenShift Lightspeed:

Align observability architecture with OpenShift Lightspeed to build AI-driven (Agentic) automation and operational insights.
Support predictive maintenance and workload optimization by defining focused agents to support Lightspeed workflows.

GenAI Workloads monitoring with OpenTelemetry:

Establish standardized OpenTelemetry (OTel) schemas for Generative AI workloads to ensure consistent tracing, logging, and metrics collection.
Capture key AI workload telemetry such as model execution latency, inference throughput, and token generation efficiency.
Provide documentation and reference implementations for AI developers, RHOAI platform and for integration with 3rd party vendors.

Integration with AI Partners (Dynatrace):

Ensure seamless data flow from OpenTelemetry observability into Dynatrace’s monitoring platform.
Work with partner engineering and Dynatrace to deliver pre-configured dashboards and alerting mechanisms within Dynatrace UI for AI workload health monitoring.
Plan for integrating with other partners with similar approach as with DT

Impact & Value:

Improved AI Platform Reliability: Enhanced monitoring capabilities reduce downtime and optimize resource allocation.
Better AI Workload Performance: Insights into GenAI workloads enable efficient model execution and cost management.
Stronger Ecosystem Integration: OpenShift Lightspeed and Dynatrace interoperability provide a strong partnership focused on AI observability.
Standardization & Extensibility: OpenTelemetry approach ensures compatibility with various observability tools and AI frameworks.

Assignee:: Radek Vokal

Reporter:: Radek Vokal

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/02/24 12:20 PM

Updated:: 2025/02/24 5:36 PM

Details

Description

Key Deliverables:

Impact & Value:

Attachments

Easy Agile Planning Poker

Activity

People

Dates