Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-22688

Support System-Level Federated Learning Metrics in Open Cluster Management

XMLWordPrintable

    • GH Train-31
    • None

      Value Statement

      System-Level Metrics (Node & Pod Resource Usage)

      In the first phase, we focus on monitoring resource consumption for FL server and client components.

      Metric Type Source Collection Tool
      CPU Usage Node / Pod cAdvisor + OpenTelemetry
      Memory Usage Node / Pod cAdvisor + OpenTelemetry
      GPU Utilization Node / Pod (NVIDIA) Kepler (via DCGM)
      Power Consumption Node / Container Level Kepler (RAPL or cgroup-based) - computation and communication

      Collection Setup

      • cAdvisor (via Kubelet)
      • Built into Kubernetes, exposes per-pod CPU and memory usage at /metrics/cadvisor on port 10250.
      • Scraped by Prometheus or collected via OpenTelemetry Collector.
      • Kepler
      • Deployed as a DaemonSet in each cluster.
      • Collects power, CPU cycles, and optionally GPU data (if NVIDIA + DCGM enabled).
      • OpenTelemetry Collector
      • Gathers metrics from both cAdvisor and Kepler.
      • Forwards them to backends like Prometheus, OTLP, or Grafana Cloud.

      Adds useful tags: component=fl-server, component=fl-client, cluster_id, pod_name, etc.

      Definition of Done for Engineering Story Owner (Checklist)

      • enable the open telemetry collector addon in open-cluster-management
      • install prometheus server in the hub cluster
      • can get the above metrics from the hub prometheus endpoint

      Development Complete

      • The code is complete.
      • Functionality is working.
      • Any required downstream Docker file changes are made.

      Tests Automated

      • [ ] Unit/function tests have been automated and incorporated into the
        build.
      • [ ] 100% automated unit/function test coverage for new or changed APIs.

      Secure Design

      • [ ] Security has been assessed and incorporated into your threat model.

      Multidisciplinary Teams Readiness

      Support Readiness

      • [ ] The must-gather script has been updated.

              rh-ee-myan Meng Yan
              yuhe@redhat.com Yuanyuan He
              Hui Chen Hui Chen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: