Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: stf-1.5.7
Affects Version/s: stf-1.5.6
Component/s: sg-core
Labels:
None

Story Points:
0
Epic Link:
OSPRH-23827
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
AssignedTeam:
rhos-observability-telemetry
Regression:
None
Release Note Text:

Hide
.The `sg-core` container buffer is now sized dynamically to ensure the metrics from long messages are processed

Before this update, the buffer size of the `sg-core` container was too small to process long messages generated by large scale deployments, which caused metrics to be lost. This update implements a dynamic buffer size that grows to handle higher loads, and a higher limit that matches the capabilities of the Unix socket implementations used by STF. The long messages and metrics generated by large scale deployments are now processed correctly.

Show
.The `sg-core` container buffer is now sized dynamically to ensure the metrics from long messages are processed Before this update, the buffer size of the `sg-core` container was too small to process long messages generated by large scale deployments, which caused metrics to be lost. This update implements a dynamic buffer size that grows to handle higher loads, and a higher limit that matches the capabilities of the Unix socket implementations used by STF. The long messages and metrics generated by large scale deployments are now processed correctly.
Release Note Type:
Bug Fix
Intelligence Requested:
Market:
PX Impact Score:

Sprint:
Bug Delivery Tracker, Observability Sprint 2026 1, Observability Sprint 2025 EOY
sprint_count:
3
Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

To Reproduce Steps to reproduce the behavior:

In environments with a high number of instances (currently hitting the mark in an deployment with ~100 Nova instances), get STF to monitor those instances
See that the real number of instances differ for the number of instances reported by Prometehus
Turn on debug logs for sg-core
See the error

2025-12-18 04:07:31 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|3d3f\"}], |..., bigger context ...|ba54a4ed42d3ecd63573aa1a00faacabecf720123d3f\"}], |..., handler: ceilometer-metrics[dummy-metrics0]]2025-12-18 04:07:32 [DEBUG] failed handling message [handler: ceilometer-metrics[dummy-metrics0], error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|3d3f\"}], |..., bigger context ...|ba54a4ed42d3ecd63573aa1a00faacabecf720123d3f\"}], |...]

and

2025-12-18 05:08:15 [WARN] full read buffer used [plugin: socket]2025-12-18 05:08:15 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|4ed42d3ecd|..., bigger context ...|signature\": \"2acacbd9c515743bdde5ba54a4ed42d3ecd|..., handler: ceilometer-metrics[socket0]]2025-12-18 05:08:16 [DEBUG] receiving 1 msg/s [plugin: socket]2025-12-18 05:08:16 [WARN] full read buffer used [plugin: socket]2025-12-18 05:08:16 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|4ed42d3ecd|..., bigger context ...|signature\": \"2acacbd9c515743bdde5ba54a4ed42d3ecd|..., handler: ceilometer-metrics[socket0]]2025-12-18 05:08:17 [DEBUG] receiving 1 msg/s [plugin: socket]

Note sg-core internal metrics. We see received metrics, but neither decode_count or decode_error was increased

Processed metric:{  "Name": "sg_total_ceilometer_msg_received_count",  "Time": 0,  "Type": 1,  "Interval": 0,  "Value": 4,  "LabelKeys": [    "source"  ],  "LabelVals": [    "SG"  ]}Processed metric:{  "Name": "sg_total_ceilometer_metric_decode_count",  "Time": 0,  "Type": 1,  "Interval": 0,  "Value": 0,  "LabelKeys": [    "source"  ],  "LabelVals": [    "SG"  ]}Processed metric:{  "Name": "sg_total_ceilometer_metric_decode_error_count",  "Time": 0,  "Type": 1,  "Interval": 0,  "Value": 0,  "LabelKeys": [    "source"  ],  "LabelVals": [    "SG"  ]}

Expected behavior

STF should report all the number of instances

Bug impact

This impacts all users using STF to monitor OSP deployment with a high number of instances.

Known workaround

A potential workaround would be to change the polling interval to a lower value, hence the polling is more frequent and the message with metrics is shorter. This has not been verified.

Additional context

Related bug https://bugzilla.redhat.com/show_bug.cgi?id=2016460
The issue lies on the constraint of hardwiring the buffer size to 65535. This was set to this value since the UDP socket has this limitation, but we are using Unix sockets (which can handle a bigger buffer) for STF. We should raise the value to adjust to environment with more instances.
…

Assignee:: Victoria Martinez de la Cruz

Reporter:: Victoria Martinez de la Cruz

Team:: rhos-observability-telemetry

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/12/18 1:54 PM

Updated:: 2026/02/18 7:48 PM

Resolved:: 2026/02/12 1:46 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty