-
Feature
-
Resolution: Unresolved
-
Normal
-
None
-
False
-
None
-
False
-
Not Selected
-
0
-
OBSDA-1033Power monitoring POST GA Tracker
-
100% To Do, 0% In Progress, 0% Done
Background
As mentioned in a blog, Kepler claims to collect real-time power consumption metrics from the node components using various APIs, such as Intel Running Average Power Limit (RAPL) for CPU and DRAM power, NVIDIA Management Library (NVML) for GPU power, Advanced Configuration and Power Interface (ACPI) for platform power, i.e, the entire node power, Redfish/Intelligent Power Management Interface (IPMI) also for platform power, or Regression-based Trained Power Models when no real-time power metrics are available in the system.
However, in the Work in progress Challenges section, the authors also mention that "Extra Data Import Support: One of the key focuses of the Kepler community is to broaden its horizons by providing extra power data import support, e.g., power source from Board Management Controller (BMC), IPMI support, and RedFish support".
Problem statement
It is not clear which power meters are been used. This, together with knowing which counters are available in each architecture becomes relevant when trying to understand in which CPU architectures is Kepler able to provide sound results and, thus, used.
Work done in kepler upstream has been led by IBM, and Intel to some extent. Does this work apply to all their cpu architectures and brands? Or does it only apply to a limited set of Intel processors?
If RAPL is not available, can DRAM be still used?
All in all, can Kepler be used in the following list of processors used in the public cloud?
User Story
- As an OpenShift customer, I want to understand if kepler metrics coming from workloads running in the public cloud be trusted and reproduced so that I can make business decisions.
- For example, if I measure my workload on AWS twice, can I expect the same results when not controlling the type of instance (and processor) being deployed? Can I make business decisions based on these metrics?
- If I run on GCP, will I get the same numbers?
Requirement
If the previous answers depend on hardware, it shall be documented:
- Which APIs is Power Monitoring actually able to use? (RAPL, others?)
- In which public clouds is power monitoring supported?
- What do I need to know about my hardware to trust metrics reported by power monitoring? Is it important to know any of the following?
- Hardware (CPU, GPU...)
- Kernel
- Cloud providers (AWS, GCP, Azure)
- On-prem supported versions
- Virtualization of the cluster
- Does Kepler collect data on master and worker nodes?
For all the above cases, if there are differences on how kepler should be installed and used, such differences shall be documented as well.
Compatibility matrix example from scaphandre: https://hubblo-org.github.io/scaphandre-documentation/compatibility.html
Of course, Kepler also has estimators, which should be added to the "matrix".
Acceptance Criteria
- The documentation needs to be clear, in advance, on the validity of numbers from installing power monitoring before installing it.
- clones
-
OBSDA-627 Documentation of Supported Hardware and Platforms for Power monitoring
- To Do