-
Bug
-
Resolution: Done
-
Major
-
ACM 2.12.0
-
Quality / Stability / Reliability
-
False
-
-
False
-
-
-
Important
-
None
Description of problem:
When using multiple workers, we might loose metrics. This seems to be caused by a race-condition causing the different workers to federate the same metrics from Prometheus.
Version-Release number of selected component (if applicable):
ACM 2.12-ACM2.15
How reproducible:
- Most of the time
Steps to Reproduce:
- Setup a ACM Hub+spoke
- Scale up to use more than 1 worker in the observabilityAddonSpec in the MCO CR
- You may need to restart metric-collector a few times to hit bad state
- You can view the count(
{cluster="your-cluster-name"}
) to confirm the issue. It should be very stable and not change when the metric collector is restarted
Actual results:
- Metrics might be missing
- The easiest way to confirm this issue if when the multiple workers are sending the exact same number of timeseries, like below (it also seem to always happen on startup):
level=debug caller=logger.go:45 ts=2025-10-03T13:45:13.278299233Z shard=3 component=forwarder component=metricsclient timeseriesnumber=13730 level=debug caller=logger.go:45 ts=2025-10-03T13:45:13.318888053Z shard=0 component=forwarder component=metricsclient timeseriesnumber=13730 level=debug caller=logger.go:45 ts=2025-10-03T13:45:13.320064215Z shard=1 component=forwarder component=metricsclient timeseriesnumber=13730 level=debug caller=logger.go:45 ts=2025-10-03T13:45:13.367873111Z shard=2 component=forwarder component=metricsclient timeseriesnumber=15300
Expected results:
- No metrics are lost. The workers produce a different number of timeseries.
Additional info:
- clones
-
ACM-24920 Using multiple metric-collector workers may cause loss of metrics due to race condition
-
- Closed
-