-
Outcome
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
False
-
None
-
False
-
Not Selected
-
0
-
71% To Do, 29% In Progress, 0% Done
Goal: Enable observability for the Red Hat AI platforms by exposing accelerator metrics, integrating with OpenShift Lightspeed, defining OpenTelemetry schemantic convetion for GenAI workloads, and ensuring seamless monitoring of AI workloads with Red Hat partners (Dynatrace).
Key Deliverables:
Accelerator Metrics:
- Implement visibility into GPU, TPU, and other accelerator performance metrics, including utilization, memory, power consumption, and efficiency.
- Standardize metrics collection across various AI accelerators to ensure consistency in monitoring.
- Expose these metrics within OpenShift Observe UI and for internal use within TeleSense.
Integration with OpenShift Lightspeed:
- Align observability architecture with OpenShift Lightspeed to build AI-driven (Agentic) automation and operational insights.
- Support predictive maintenance and workload optimization by defining focused agents to support Lightspeed workflows.
GenAI Workloads monitoring with OpenTelemetry:
- Establish standardized OpenTelemetry (OTel) schemas for Generative AI workloads to ensure consistent tracing, logging, and metrics collection.
- Capture key AI workload telemetry such as model execution latency, inference throughput, and token generation efficiency.
- Provide documentation and reference implementations for AI developers, RHOAI platform and for integration with 3rd party vendors.
Integration with AI Partners (Dynatrace):
- Ensure seamless data flow from OpenTelemetry observability into Dynatrace’s monitoring platform.
- Work with partner engineering and Dynatrace to deliver pre-configured dashboards and alerting mechanisms within Dynatrace UI for AI workload health monitoring.
- Plan for integrating with other partners with similar approach as with DT
Impact & Value:
- Improved AI Platform Reliability: Enhanced monitoring capabilities reduce downtime and optimize resource allocation.
- Better AI Workload Performance: Insights into GenAI workloads enable efficient model execution and cost management.
- Stronger Ecosystem Integration: OpenShift Lightspeed and Dynatrace interoperability provide a strong partnership focused on AI observability.
- Standardization & Extensibility: OpenTelemetry approach ensures compatibility with various observability tools and AI frameworks.