Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: ACM 2.12.6
Affects Version/s: ACM 2.12.0
Component/s: Observability
Labels:
- doc-required
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Severity:
Important

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Description of problem:

When using multiple workers, we might loose metrics. This seems to be caused by a race-condition causing the different workers to federate the same metrics from Prometheus.

Version-Release number of selected component (if applicable):

ACM 2.12-ACM2.15

How reproducible:

Most of the time

Steps to Reproduce:

Setup a ACM Hub+spoke
Scale up to use more than 1 worker in the observabilityAddonSpec in the MCO CR
You may need to restart metric-collector a few times to hit bad state
You can view the count( {cluster="your-cluster-name"}
) to confirm the issue. It should be very stable and not change when the metric collector is restarted

Actual results:

Metrics might be missing
The easiest way to confirm this issue if when the multiple workers are sending the exact same number of timeseries, like below (it also seem to always happen on startup):

level=debug caller=logger.go:45 ts=2025-10-03T13:45:13.278299233Z shard=3 component=forwarder component=metricsclient timeseriesnumber=13730
level=debug caller=logger.go:45 ts=2025-10-03T13:45:13.318888053Z shard=0 component=forwarder component=metricsclient timeseriesnumber=13730
level=debug caller=logger.go:45 ts=2025-10-03T13:45:13.320064215Z shard=1 component=forwarder component=metricsclient timeseriesnumber=13730
level=debug caller=logger.go:45 ts=2025-10-03T13:45:13.367873111Z shard=2 component=forwarder component=metricsclient timeseriesnumber=15300

Expected results:

No metrics are lost. The workers produce a different number of timeseries.

Additional info:

clones

ACM-24920 Using multiple metric-collector workers may cause loss of metrics due to race condition

Closed

Assignee:: Jacob Baungard Hansen

Reporter:: Jacob Baungard Hansen

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/10/03 3:12 PM

Updated:: 2025/10/08 5:17 PM

Resolved:: 2025/10/07 12:09 PM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates