-
Feature
-
Resolution: Unresolved
-
Normal
-
None
-
False
-
None
-
False
-
Not Selected
-
0
-
OBSDA-731Power monitoring GA Release Tracker
-
67% To Do, 0% In Progress, 33% Done
Background
The aim of this feature is to ensure that kepler metrics produced on Bare Metal are accurate, and that the values are compared against other tools such as node-exporter, process-exporter, etc.
Motivation
The motivation for this work is varied. On the one hand, not all hardware is providing the same APIs for power monitoring. We know RAPL is in nearly all Intel modern processors, also present in some AMD. ACPI plays a significant role for the platform power reporting too. apart from that, RedFish has shown better results when compared with power meter ones. (See OBSDA-645).
Requirements
All Available Metrics
- kepler_<level>_bpf_block_irq_total
- kepler_<level>_bpf_cpu_time_ms_total
- kepler_<level>_bpf_net_rx_irq_total
- kepler_<level>_bpf_net_tx_irq_total
- kepler_<level>_bpf_page_cache_hit_total
- kepler_<level>_cache_miss_total
- kepler_<level>_cpu_cycles_total
- kepler_<level>_cpu_instructions_total
- kepler_<level>_cpu_ref_cycles_total
- kepler_<level>_package_joules_total
- kepler_<level>_platform_joules_total
- kepler_<level>_uncore_joules_total
- kepler_<level>_core_joules_total
- kepler_<level>_dram_joules_total
- kepler_<level>_joules_total
- kepler_<level>_other_joules_total
Metrics Validated
- <level>_joules_total
- component metrics for each level (node, process, vm, container ) - package, core, dram, uncore, other
- platform metrics for node, process, vm, container - acpi, redfish
- kepler_<level>_joules_total : <needs to be well defined>
- kepler_<level>_package_joules_total
- kepler_<level>_platform_joules_total
Validated at a Node Level
- kepler_node_core_joules_total
- kepler_node_uncore_joules_total
- kepler_node_dram_joules_total
- kepler_node_other_joules_total
Metrics not covered in GA
- kepler_<level>_bpf_block_irq_total
- kepler_<level>_bpf_cpu_time_ms_total
- kepler_<level>_bpf_net_rx_irq_total
- kepler_<level>_bpf_net_tx_irq_total
- kepler_<level>_bpf_page_cache_hit_total
- kepler_<level>_cache_miss_total
- kepler_<level>_cpu_cycles_total
- kepler_<level>_cpu_instructions_total
- kepler_<level>_cpu_ref_cycles_total
Upon the completion of this ticket, power monitoring users shall:
- Have a list of metrics that are validated against other tools on Bare Metal
- Test results will have MAE and MAPE showing differences in measurements against tools used to compare the values
- is cloned by
-
OBSDA-1025 Support GPU power metrics
- New