Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-23826

sg-core fails to handle long messages

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • stf-1.5.7
    • stf-1.5.6
    • sg-core
    • None
    • 0
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • rhos-observability-telemetry
    • None
    • Hide
      .The `sg-core` container buffer is now sized dynamically to ensure the metrics from long messages are processed

      Before this update, the buffer size of the `sg-core` container was too small to process long messages generated by large scale deployments, which caused metrics to be lost. This update implements a dynamic buffer size that grows to handle higher loads, and a higher limit that matches the capabilities of the Unix socket implementations used by STF. The long messages and metrics generated by large scale deployments are now processed correctly.
      Show
      .The `sg-core` container buffer is now sized dynamically to ensure the metrics from long messages are processed Before this update, the buffer size of the `sg-core` container was too small to process long messages generated by large scale deployments, which caused metrics to be lost. This update implements a dynamic buffer size that grows to handle higher loads, and a higher limit that matches the capabilities of the Unix socket implementations used by STF. The long messages and metrics generated by large scale deployments are now processed correctly.
    • Bug Fix
    • Bug Delivery Tracker, Observability Sprint 2026 1, Observability Sprint 2025 EOY
    • 3
    • Important

      To Reproduce Steps to reproduce the behavior:

      1. In environments with a high number of instances (currently hitting the mark in an deployment with ~100 Nova instances), get STF to monitor those instances
      2. See that the real number of instances differ for the number of instances reported by Prometehus
      3. Turn on debug logs for sg-core
      4. See the error
      2025-12-18 04:07:31 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|3d3f\"}], |..., bigger context ...|ba54a4ed42d3ecd63573aa1a00faacabecf720123d3f\"}], |..., handler: ceilometer-metrics[dummy-metrics0]]2025-12-18 04:07:32 [DEBUG] failed handling message [handler: ceilometer-metrics[dummy-metrics0], error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|3d3f\"}], |..., bigger context ...|ba54a4ed42d3ecd63573aa1a00faacabecf720123d3f\"}], |...] 

      and

      2025-12-18 05:08:15 [WARN] full read buffer used [plugin: socket]2025-12-18 05:08:15 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|4ed42d3ecd|..., bigger context ...|signature\": \"2acacbd9c515743bdde5ba54a4ed42d3ecd|..., handler: ceilometer-metrics[socket0]]2025-12-18 05:08:16 [DEBUG] receiving 1 msg/s [plugin: socket]2025-12-18 05:08:16 [WARN] full read buffer used [plugin: socket]2025-12-18 05:08:16 [DEBUG] failed handling message [error: ceilometer.OsloSchema.Request: OsloMessage: readStringSlowPath: unexpected end of input, error found in #10 byte of ...|4ed42d3ecd|..., bigger context ...|signature\": \"2acacbd9c515743bdde5ba54a4ed42d3ecd|..., handler: ceilometer-metrics[socket0]]2025-12-18 05:08:17 [DEBUG] receiving 1 msg/s [plugin: socket] 

      Note sg-core internal metrics. We see received metrics, but neither decode_count or decode_error was increased

      Processed metric:{  "Name": "sg_total_ceilometer_msg_received_count",  "Time": 0,  "Type": 1,  "Interval": 0,  "Value": 4,  "LabelKeys": [    "source"  ],  "LabelVals": [    "SG"  ]}Processed metric:{  "Name": "sg_total_ceilometer_metric_decode_count",  "Time": 0,  "Type": 1,  "Interval": 0,  "Value": 0,  "LabelKeys": [    "source"  ],  "LabelVals": [    "SG"  ]}Processed metric:{  "Name": "sg_total_ceilometer_metric_decode_error_count",  "Time": 0,  "Type": 1,  "Interval": 0,  "Value": 0,  "LabelKeys": [    "source"  ],  "LabelVals": [    "SG"  ]} 

      Expected behavior

      • STF should report all the number of instances

      Bug impact

      • This impacts all users using STF to monitor OSP deployment with a high number of instances.

      Known workaround

      • A potential workaround would be to change the polling interval to a lower value, hence the polling is more frequent and the message with metrics is shorter. This has not been verified.

      Additional context

      • Related bug https://bugzilla.redhat.com/show_bug.cgi?id=2016460
      • The issue lies on the constraint of hardwiring the buffer size to 65535. This was set to this value since the UDP socket has this limitation, but we are using Unix sockets (which can handle a bigger buffer) for STF. We should raise the value to adjust to environment with more instances.

              rhn-engineering-vimartin Victoria Martinez de la Cruz
              rhn-engineering-vimartin Victoria Martinez de la Cruz
              rhos-observability-telemetry
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: