-
Feature
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
False
-
-
False
-
Not Selected
-
0
Feature Overview (mandatory - Complete while in New status)
Introduce granular GPU Per-Process Power Metrics into Power Monitoring (Kepler). This is critical because AI workloads rely on GPUs, which consume high power. Measuring this consumption per process allows users to pinpoint waste and make data-driven decisions on workload placement to maximize efficiency.{}
Goals (mandatory - Complete while in New status)
Deliver per-process GPU power observability, functionally similar to existing CPU metrics, to enable optimization of GPU-intensive workloads.
What is the difference between today’s current state and a world with this Feature?
Current State: Kepler currently has ** no GPU power support which results in the likely largest power draw being a blindspot
Future State: Users can monitor real-time GPU energy usage down to the process level, allowing for informed data driven decisions.
Requirements (mandatory - _Complete while in Refinement status):
| Requirement | Notes | isMVP? |
| GPU Power Metrics must be gathered at the Process level. | Must mirror CPU Process Metrics functionality (e.g., kepler_process_cpu_watts). | Yes |
| Metrics support for multi-instance GPUs. | ? | |
Done - Acceptance Criteria (mandatory - Complete while in Refinement status): # GPU energy consumption metrics are successfully collected and exposed by Kepler at the container, pod, and process granularity.
- Users can visualize and track GPU power metrics within the OpenShift Console dashboards.
- The new GPU metrics demonstrate accurate measurement for workloads utilizing multi-instance GPUs.
Out of Scope (Initial completion while in Refinement *status):_ * GPU AI or performance metrics (focus is purely on power/energy attribution).
- Any UI/dashboard development beyond displaying the new Kepler metric data.