Uploaded image for project: 'Observability and Data Analysis Program'
  1. Observability and Data Analysis Program
  2. OBSDA-1032

Documentation of Supported Hardware and Platforms for Power monitoring 1.1

XMLWordPrintable

    • False
    • None
    • False
    • Not Selected
    • 0
    • OBSDA-1033Power monitoring POST GA Tracker
    • 100% To Do, 0% In Progress, 0% Done

      Background

      As mentioned in a blog, Kepler claims to collect real-time power consumption metrics from the node components using various APIs, such as Intel Running Average Power Limit (RAPL) for CPU and DRAM power, NVIDIA Management Library (NVML) for GPU power, Advanced Configuration and Power Interface (ACPI) for platform power, i.e, the entire node power, Redfish/Intelligent Power Management Interface (IPMI) also for platform power, or Regression-based Trained Power Models when no real-time power metrics are available in the system.

      However, in the Work in progress Challenges section, the authors also mention that "Extra Data Import Support: One of the key focuses of the Kepler community is to broaden its horizons by providing extra power data import support, e.g., power source from Board Management Controller (BMC), IPMI support, and RedFish support".

      Problem statement

      It is not clear which power meters are been used. This, together with knowing which counters are available in each architecture  becomes relevant when trying to understand in which CPU architectures is Kepler able to provide sound results and, thus, used. 

      Work done in kepler upstream has been led by IBM, and Intel to some extent. Does this work apply to all their cpu architectures and brands? Or does it only apply to a limited set of Intel processors?

      If RAPL is not available, can DRAM be still used?

      All in all, can Kepler be used in the following list of processors used in the public cloud?

      1. GCP
      2. AWS
      3. Azure BM
      4. Azure VMs

      User Story

      1. As an OpenShift customer, I want to understand if kepler metrics coming from workloads running in the public cloud be trusted and reproduced so that I can make business decisions.
      2. For example, if I measure my workload on AWS twice, can I expect the same results when not controlling the type of instance (and processor) being deployed? Can I make business decisions based on these metrics?
      3. If I run on GCP, will I get the same numbers?

      Requirement

      If the previous answers depend on hardware, it shall be documented:

      1. Which APIs is Power Monitoring actually able to use? (RAPL, others?)
      2.  In which public clouds is power monitoring supported?
      3. What do I need to know about my hardware to trust metrics reported by power monitoring? Is it important to know any of the following?
        1. Hardware (CPU, GPU...)
        2. Kernel
        3. Cloud providers (AWS, GCP, Azure)
        4. On-prem supported versions
        5. Virtualization of the cluster
      4. Does Kepler collect data on master and worker nodes?

      For all the above cases, if there are differences on how kepler should be installed and used, such differences shall be documented as well.

      Compatibility matrix example from scaphandre: https://hubblo-org.github.io/scaphandre-documentation/compatibility.html 

      Of course, Kepler also has estimators, which should be added to the "matrix".

      Acceptance Criteria

      • The documentation needs to be clear, in advance, on the validity of numbers from installing power monitoring before installing it.

              rh-ee-rfloren Roger Florén
              rh-ee-rfloren Roger Florén
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: