-
Story
-
Resolution: Done
-
Major
-
None
-
None
-
Product / Portfolio Work
-
3
-
False
-
-
False
-
-
-
-
GH Train-31
-
None
Value Statement
System-Level Metrics (Node & Pod Resource Usage)
In the first phase, we focus on monitoring resource consumption for FL server and client components.
Metric Type | Source | Collection Tool |
CPU Usage | Node / Pod | cAdvisor + OpenTelemetry |
Memory Usage | Node / Pod | cAdvisor + OpenTelemetry |
GPU Utilization | Node / Pod (NVIDIA) | Kepler (via DCGM) |
Power Consumption | Node / Container Level | Kepler (RAPL or cgroup-based) - computation and communication |
Collection Setup
- cAdvisor (via Kubelet)
- Built into Kubernetes, exposes per-pod CPU and memory usage at /metrics/cadvisor on port 10250.
- Scraped by Prometheus or collected via OpenTelemetry Collector.
- Kepler
- Deployed as a DaemonSet in each cluster.
- Collects power, CPU cycles, and optionally GPU data (if NVIDIA + DCGM enabled).
- OpenTelemetry Collector
- Gathers metrics from both cAdvisor and Kepler.
- Forwards them to backends like Prometheus, OTLP, or Grafana Cloud.
Adds useful tags: component=fl-server, component=fl-client, cluster_id, pod_name, etc.
Definition of Done for Engineering Story Owner (Checklist)
- enable the open telemetry collector addon in open-cluster-management
- install prometheus server in the hub cluster
- can get the above metrics from the hub prometheus endpoint
Development Complete
- The code is complete.
- Functionality is working.
- Any required downstream Docker file changes are made.
Tests Automated
[ ] Unit/function tests have been automated and incorporated into the
build.[ ] 100% automated unit/function test coverage for new or changed APIs.
Secure Design
[ ] Security has been assessed and incorporated into your threat model.
Multidisciplinary Teams Readiness
[ ] Create an informative documentation issue using the [Customer
Portal_doc_issue template](
https://github.com/stolostron/backlog/issues/new?assignees=&labels=squad%3Adoc&template=doc_issue.md&title=),
and ensure doc acceptance criteria is met. Link the development issue to
the doc issue.[ ] Provide input to the QE team, and ensure QE acceptance criteria
(established between story owner and QE focal) are met.
Support Readiness
[ ] The must-gather script has been updated.
- clones
-
ACM-19546 Enable Observability in OCM with Federated Learning PoC
-
- Closed
-
There are no Sub-Tasks for this issue.