Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-30259

After MCOA is enabled, observability-thanos-receive pods have lot of "context deadline exceeded" errors

XMLWordPrintable

    • Important
    • None

      Description of problem:

      Tying to test ACM-22533 in the large perf/scale env with 2500+ managed SNOs.

      After enable MCOA, there are lots of "context deadline exceeded" error in the logs of the observability-thanos-receive pods. examples are shown below

      ts=2026-02-18T21:55:09.271237405Z caller=handler.go:637 level=error component=receive component=receive-handler tenant=f5e0558d-16cb-499e-8b31-50be09a157b9 err="context deadline exceeded" msg="internal server error"
      ts=2026-02-18T21:55:09.234831218Z caller=handler.go:637 level=error component=receive component=receive-handler tenant=f5e0558d-16cb-499e-8b31-50be09a157b9 err="context deadline exceeded" msg="internal server error"
      ts=2026-02-18T21:55:09.345420213Z caller=handler.go:637 level=error component=receive component=receive-handler tenant=f5e0558d-16cb-499e-8b31-50be09a157b9 err="context deadline exceeded" msg="internal server error"
      ts=2026-02-18T21:55:09.777473285Z caller=handler.go:637 level=error component=receive component=receive-handler tenant=f5e0558d-16cb-499e-8b31-50be09a157b9 err="context deadline exceeded" msg="internal server error"
      ts=2026-02-18T21:55:09.781436352Z caller=handler.go:637 level=error component=receive component=receive-handler tenant=f5e0558d-16cb-499e-8b31-50be09a157b9 err="context deadline exceeded" msg="internal server error"

      these errors were not there before MCOA is enable. Checked the agents on the managed SNOs, the prom-agent-platform-metrics-collector-0 pod log has corresponding errors, examples are shown below 

      time=2026-02-18T22:00:09.601Z level=WARN source=queue_manager.go:2027 msg="Failed to send batch, retrying" component=remote remote_name=acm-observability url=https://observatorium-api-open-cluster-management-observability.apps.acm-gigantic.rdu2.scalelab.redhat.com/api/metrics/v1/default/api/v1/receive err="server returned HTTP status 500 Internal Server Error: context deadline exceeded\n"

       

      checked the grafana dashboard, as shown in the below screen shot for one of the SNOs, the cpu usage data appears after the new agents are installed, after the small gap. So looks like the metrics pushing to the server  is not totally broken,  just some of them are timed out.   Attatching the thanos receive pod logs before MCOA is enabled run11_b4_MCOA_observability-thanos-receive-default-2-thanos-receive.logand the logs after MCOA is enabled.  Also the agents logs before MCOA is enabled: vm00001_metrics-collector-deployment.logand the new agent log after MCOA is enabled: vm00001_prom-agent-platform-metrics-collector-0.log

      Version-Release number of selected component (if applicable):

      How reproducible:

      Steps to Reproduce:

      1.  
      2.  
      3. ...

      Actual results:

      Expected results:

      Additional info:

              Unassigned Unassigned
              rhn-support-txue Ting Xue
              ACM QE Team
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: