Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-24923

[2.12] Using multiple metric-collector workers may cause loss of metrics due to race condition

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • False
    • Important
    • None

      Description of problem:

      When using multiple workers, we might loose metrics. This seems to be caused by a race-condition causing the different workers to federate the same metrics from Prometheus.

      Version-Release number of selected component (if applicable):

      ACM 2.12-ACM2.15

      How reproducible:

      • Most of the time

      Steps to Reproduce:

      1. Setup a ACM Hub+spoke
      2. Scale up to use more than 1 worker in the observabilityAddonSpec in the MCO CR
      3. You may need to restart metric-collector a few times to hit bad state
      4. You can view the count( {cluster="your-cluster-name"}

        ) to confirm the issue. It should be very stable and not change when the metric collector is restarted

      Actual results:

      • Metrics might be missing
      • The easiest way to confirm this issue if when the multiple workers are sending the exact same number of timeseries, like below (it also seem to always happen on startup):
      level=debug caller=logger.go:45 ts=2025-10-03T13:45:13.278299233Z shard=3 component=forwarder component=metricsclient timeseriesnumber=13730
      level=debug caller=logger.go:45 ts=2025-10-03T13:45:13.318888053Z shard=0 component=forwarder component=metricsclient timeseriesnumber=13730
      level=debug caller=logger.go:45 ts=2025-10-03T13:45:13.320064215Z shard=1 component=forwarder component=metricsclient timeseriesnumber=13730
      level=debug caller=logger.go:45 ts=2025-10-03T13:45:13.367873111Z shard=2 component=forwarder component=metricsclient timeseriesnumber=15300
      

      Expected results:

      • No metrics are lost. The workers produce a different number of timeseries.

      Additional info:

              rh-ee-jachanse Jacob Baungard Hansen
              rh-ee-jachanse Jacob Baungard Hansen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: