Loading...

XML

Word

Printable

Type: Feature
Resolution: Unresolved
Priority: Normal
Fix Version/s: powermon-1.1
Affects Version/s: None
Component/s: PM Power-monitoring
Labels:
- powermon-ga
- productization

Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Not Selected
PM Score:
0
Parent Link:
OBSDA-1033Power monitoring POST GA Tracker
Hierarchy Progress Bar:

100% To Do, 0% In Progress, 0% Done

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Intelligence Requested:
Market:

Background

As mentioned in a blog, Kepler claims to collect real-time power consumption metrics from the node components using various APIs, such as Intel Running Average Power Limit (RAPL) for CPU and DRAM power, NVIDIA Management Library (NVML) for GPU power, Advanced Configuration and Power Interface (ACPI) for platform power, i.e, the entire node power, Redfish/Intelligent Power Management Interface (IPMI) also for platform power, or Regression-based Trained Power Models when no real-time power metrics are available in the system.

However, in the Work in progress Challenges section, the authors also mention that "Extra Data Import Support: One of the key focuses of the Kepler community is to broaden its horizons by providing extra power data import support, e.g., power source from Board Management Controller (BMC), IPMI support, and RedFish support".

Problem statement

It is not clear which power meters are been used. This, together with knowing which counters are available in each architecture becomes relevant when trying to understand in which CPU architectures is Kepler able to provide sound results and, thus, used.

Work done in kepler upstream has been led by IBM, and Intel to some extent. Does this work apply to all their cpu architectures and brands? Or does it only apply to a limited set of Intel processors?

If RAPL is not available, can DRAM be still used?

All in all, can Kepler be used in the following list of processors used in the public cloud?

User Story

As an OpenShift customer, I want to understand if kepler metrics coming from workloads running in the public cloud be trusted and reproduced so that I can make business decisions.
For example, if I measure my workload on AWS twice, can I expect the same results when not controlling the type of instance (and processor) being deployed? Can I make business decisions based on these metrics?
If I run on GCP, will I get the same numbers?

Requirement

If the previous answers depend on hardware, it shall be documented:

Which APIs is Power Monitoring actually able to use? (RAPL, others?)
In which public clouds is power monitoring supported?
What do I need to know about my hardware to trust metrics reported by power monitoring? Is it important to know any of the following?
1. Hardware (CPU, GPU...)
2. Kernel
3. Cloud providers (AWS, GCP, Azure)
4. On-prem supported versions
5. Virtualization of the cluster
Does Kepler collect data on master and worker nodes?

For all the above cases, if there are differences on how kepler should be installed and used, such differences shall be documented as well.

Compatibility matrix example from scaphandre: https://hubblo-org.github.io/scaphandre-documentation/compatibility.html

Of course, Kepler also has estimators, which should be added to the "matrix".

Acceptance Criteria

The documentation needs to be clear, in advance, on the validity of numbers from installing power monitoring before installing it.

clones

OBSDA-627 Documentation of Supported Hardware and Platforms for Power monitoring

To Do

Assignee:: Roger Florén

Reporter:: Roger Florén

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2024/11/19 7:47 AM

Updated:: 2024/11/19 8:00 AM

Details

Description

Background

Problem statement

User Story

Requirement

Acceptance Criteria

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates