Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: ACM 2.16.0
Affects Version/s: ACM 2.16.0
Component/s: Observability
Labels:
- triaged

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
MCOA Metrics Collection post-ga
Intelligence Requested:
Market:

Severity:
Important

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Description of problem:

Tying to test ACM-22533 in the large perf/scale env with 2500+ managed SNOs.

After enable MCOA, there are lots of "context deadline exceeded" error in the logs of the observability-thanos-receive pods. examples are shown below

ts=2026-02-18T21:55:09.271237405Z caller=handler.go:637 level=error component=receive component=receive-handler tenant=f5e0558d-16cb-499e-8b31-50be09a157b9 err="context deadline exceeded" msg="internal server error"
ts=2026-02-18T21:55:09.234831218Z caller=handler.go:637 level=error component=receive component=receive-handler tenant=f5e0558d-16cb-499e-8b31-50be09a157b9 err="context deadline exceeded" msg="internal server error"
ts=2026-02-18T21:55:09.345420213Z caller=handler.go:637 level=error component=receive component=receive-handler tenant=f5e0558d-16cb-499e-8b31-50be09a157b9 err="context deadline exceeded" msg="internal server error"
ts=2026-02-18T21:55:09.777473285Z caller=handler.go:637 level=error component=receive component=receive-handler tenant=f5e0558d-16cb-499e-8b31-50be09a157b9 err="context deadline exceeded" msg="internal server error"
ts=2026-02-18T21:55:09.781436352Z caller=handler.go:637 level=error component=receive component=receive-handler tenant=f5e0558d-16cb-499e-8b31-50be09a157b9 err="context deadline exceeded" msg="internal server error"

these errors were not there before MCOA is enable. Checked the agents on the managed SNOs, the prom-agent-platform-metrics-collector-0 pod log has corresponding errors, examples are shown below

time=2026-02-18T22:00:09.601Z level=WARN source=queue_manager.go:2027 msg="Failed to send batch, retrying" component=remote remote_name=acm-observability url=https://observatorium-api-open-cluster-management-observability.apps.acm-gigantic.rdu2.scalelab.redhat.com/api/metrics/v1/default/api/v1/receive err="server returned HTTP status 500 Internal Server Error: context deadline exceeded\n"

checked the grafana dashboard, as shown in the below screen shot for one of the SNOs, the cpu usage data appears after the new agents are installed, after the small gap. So looks like the metrics pushing to the server is not totally broken, just some of them are timed out. Attatching the thanos receive pod logs before MCOA is enabled run11_b4_MCOA_observability-thanos-receive-default-2-thanos-receive.logand the logs after MCOA is enabled. Also the agents logs before MCOA is enabled: vm00001_metrics-collector-deployment.logand the new agent log after MCOA is enabled: vm00001_prom-agent-platform-metrics-collector-0.log

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2026-02-19-11-51-31-913.png
1.40 MB
2026/02/19 4:51 PM
run11_after_MCOA_observability-thanos-receive-default-0-thanos-receive.log
403 kB
2026/02/19 4:57 PM
run11_after_MCOA_observability-thanos-receive-default-1-thanos-receive.log
400 kB
2026/02/19 4:57 PM
run11_after_MCOA_observability-thanos-receive-default-2-thanos-receive.log
352 kB
2026/02/19 4:57 PM
run11_b4_MCOA_observability-thanos-receive-default-0-thanos-receive.log
15 kB
2026/02/19 4:55 PM
run11_b4_MCOA_observability-thanos-receive-default-1-thanos-receive.log
15 kB
2026/02/19 4:55 PM
run11_b4_MCOA_observability-thanos-receive-default-2-thanos-receive.log
16 kB
2026/02/19 4:55 PM
vm00001_metrics-collector-deployment.log
16 kB
2026/02/19 4:58 PM
vm00001_prom-agent-platform-metrics-collector-0.log
2 kB
2026/02/19 4:59 PM

Assignee:: Unassigned

Reporter:: Ting Xue

Team:: ACM QE Team

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2026/02/19 5:00 PM

Updated:: 2026/02/23 7:06 PM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates