-
Bug
-
Resolution: Done
-
Major
-
stf-1.5.6
-
None
-
0
-
False
-
-
False
-
?
-
rhos-observability-telemetry
-
None
-
-
Bug Fix
-
-
-
-
Bug Delivery Tracker, Observability Sprint 2026 1, Observability Sprint 2025 EOY
-
3
-
Important
To Reproduce Steps to reproduce the behavior:
- In environments with a high number of instances (currently hitting the mark in an deployment with ~100 Nova instances), get STF to monitor those instances
- See that the real number of instances differ for the number of instances reported by Prometehus
- Turn on debug logs for sg-core
- See the error
2025-12-18 04:07:31 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|3d3f\"}], |..., bigger context ...|ba54a4ed42d3ecd63573aa1a00faacabecf720123d3f\"}], |..., handler: ceilometer-metrics[dummy-metrics0]]2025-12-18 04:07:32 [DEBUG] failed handling message [handler: ceilometer-metrics[dummy-metrics0], error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|3d3f\"}], |..., bigger context ...|ba54a4ed42d3ecd63573aa1a00faacabecf720123d3f\"}], |...]
and
2025-12-18 05:08:15 [WARN] full read buffer used [plugin: socket]2025-12-18 05:08:15 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|4ed42d3ecd|..., bigger context ...|signature\": \"2acacbd9c515743bdde5ba54a4ed42d3ecd|..., handler: ceilometer-metrics[socket0]]2025-12-18 05:08:16 [DEBUG] receiving 1 msg/s [plugin: socket]2025-12-18 05:08:16 [WARN] full read buffer used [plugin: socket]2025-12-18 05:08:16 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|4ed42d3ecd|..., bigger context ...|signature\": \"2acacbd9c515743bdde5ba54a4ed42d3ecd|..., handler: ceilometer-metrics[socket0]]2025-12-18 05:08:17 [DEBUG] receiving 1 msg/s [plugin: socket]
Note sg-core internal metrics. We see received metrics, but neither decode_count or decode_error was increased
Processed metric:{ "Name": "sg_total_ceilometer_msg_received_count", "Time": 0, "Type": 1, "Interval": 0, "Value": 4, "LabelKeys": [ "source" ], "LabelVals": [ "SG" ]}Processed metric:{ "Name": "sg_total_ceilometer_metric_decode_count", "Time": 0, "Type": 1, "Interval": 0, "Value": 0, "LabelKeys": [ "source" ], "LabelVals": [ "SG" ]}Processed metric:{ "Name": "sg_total_ceilometer_metric_decode_error_count", "Time": 0, "Type": 1, "Interval": 0, "Value": 0, "LabelKeys": [ "source" ], "LabelVals": [ "SG" ]}
Expected behavior
- STF should report all the number of instances
Bug impact
- This impacts all users using STF to monitor OSP deployment with a high number of instances.
Known workaround
- A potential workaround would be to change the polling interval to a lower value, hence the polling is more frequent and the message with metrics is shorter. This has not been verified.
Additional context
- Related bug https://bugzilla.redhat.com/show_bug.cgi?id=2016460
- The issue lies on the constraint of hardwiring the buffer size to 65535. This was set to this value since the UDP socket has this limitation, but we are using Unix sockets (which can handle a bigger buffer) for STF. We should raise the value to adjust to environment with more instances.
- …