Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-7347

Vector crashes with BufferTooSmall and enters repeated CrashLoopBackOff, causing high-impact log loss across the cluster (even with disk buffer enabled)

XMLWordPrintable

    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • False
    • NEW
    • NEW
    • Hide
      Before this change, the collector would merge large application log messages without restriction which could overflow an output buffer. This change introduces the MaxMessageSize tuning parameter for application logs to limit the size of messages and discard them if they exceed the configured threshold.
      Show
      Before this change, the collector would merge large application log messages without restriction which could overflow an output buffer. This change introduces the MaxMessageSize tuning parameter for application logs to limit the size of messages and discard them if they exceed the configured threshold.
    • Bug Fix
    • Log Collection - Sprint 273, Logging - Sprint 274, Logging - Sprint 275, Logging - Sprint 276, Logging - Sprint 277, Logging - Sprint 278, Logging - Sprint 279, Logging - Sprint 280, Logging - Sprint 281, Logging - Sprint 282, Logging - Sprint 283

      Description of problem:

      The Vector collector crashes with a BufferTooSmall error under certain log conditions and enters a repeated CrashLoopBackOff. This behavior results in high-impact log loss across the cluster, even though disk buffering with AtLeastOnce delivery mode is enabled.

      This issue is critical because:

      • It affects logs from unrelated pods, not just the one generating large or frequent log lines.
      • It can be triggered by non-cluster-admin users, increasing the operational risk.

      The following error messages were observed at the time of failure:

      2025-06-27T05:51:48.013419Z ERROR transform{component_kind="transform" component_id=output_external_rsyslog_app_parse_encoding component_type=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
      2025-06-27T05:51:48.013497Z ERROR transform{component_kind="transform" component_id=output_external_rsyslog_app_parse_encoding component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
      2025-06-27T05:51:48.014013Z INFO vector: Vector has stopped.
      

      Vector restarted immediately, but some of application logs were lost although disk buffer(AtLeastOnce) had been enabled.

      Version-Release number of selected component (if applicable):

      • OpenShift 4.18.9
      • OpenShift Logging 6.2.3

      How reproducible:

      Frequently (with our reprodution steps)

      Steps to Reproduce:

      • Step1. Define ClusterLogForwarder object which sets "deliveryMode: AtLeastOnce".
      apiVersion: observability.openshift.io/v1
      kind: ClusterLogForwarder
      metadata:
        ...
      spec:
        collector:
          resources:
            limits:
              cpu: 1
              memory: 2Gi
            requests:
              cpu: 100m
              memory: 64Mi
        managementState: Managed
        outputs:
        - lokiStack:
            authentication:
              token:
                from: serviceAccount
            target:
              name: lokistack
              namespace: openshift-logging
            tuning:
              deliveryMode: AtLeastOnce
          name: default-lokistack
      ...
      
      • Step2. Pick up one of Ready node.
      node=$(oc get nodes -l node-role.kubernetes.io/worker= | grep " Ready " | awk '\{print $1}' | head -n 1)
      
      • Step3. Create a namespace for test.
      $ oc adm new-project --node-selector="kubernetes.io/hostname=$node" test-logloss
      
      • Step4. Deploy an application which outputs numbers in order.
      $ oc run gen-numbers --image=image-registry.openshift-image-registry.svc:5000/openshift/cli -n test-logloss --  bash -c 'count=0; while :; do printf "%020d\n" "$count"; ((count=count+1)); sleep 0.1; done'
      
      • Step5. Check if logs of "gen-numbers" pod are saved on Loki.
      $ curl \
           -H "Authorization: Bearer $(oc create token -n openshift-kube-apiserver localhost-recovery-client --duration=10m)" \
           --get -sk \
           -H "Content-Type: application/json" \
           "https://$(oc get route -n openshift-logging lokistack -o jsonpath='\{.spec.host}')/api/logs/v1/application/loki/api/v1/query_range"\
           --data-urlencode 'query=\{ log_type="application" } | json | kubernetes_container_name="gen-numbers" ' \
           --data-urlencode "limit=5000" \
         | jq -rM '.data.result[]?.values[][]' \
         | jq -rM '.message?' \
         | sort | awk 'BEGIN\{pre=-1}{if (pre != -1) \{ print $1; if ($1 - pre != 1) {printf("DETECT: %d must be %d\n", $1, pre + 1)} }; pre=$1}'
      ...
      00000000000000005776
      00000000000000005777
      
      • Step6. Deploy an application which outputs long line messages repeatedly, which causes a vector crash and log loss.
      $ oc run gen-longlines --image=image-registry.openshift-image-registry.svc:5000/openshift/cli -n test-logloss --  bash -c 'len=200k; while :; do echo "`base64 -w0 /dev/urandom | head -c $len`"; done'
      
      • Step7. Check if collector pod enters repeated CrashLoopBackOff.
      $ oc get pods -n openshift-logging -w
      ...
      collector-loki-299pm                        0/1     Completed   4 (4m29s ago)   19m
      collector-loki-299pm                        0/1     CrashLoopBackOff   4 (14s ago)     19m
      
      • Step8. On another terminal, run the following commands, which detects a log loss.
      #!/bin/bash
      
      function check()
      {
        date
        curl \
            -H "Authorization: Bearer $(oc create token -n openshift-kube-apiserver localhost-recovery-client --duration=10m)" \
            --get -sk \
            -H "Content-Type: application/json" \
            "https://$(oc get route -n openshift-logging lokistack -o jsonpath='\{.spec.host}')/api/logs/v1/application/loki/api/v1/query_range"\
            --data-urlencode 'query=\{ log_type="application" } | json | kubernetes_container_name="gen-numbers" ' \
            --data-urlencode "limit=5000" \
          | jq -rM '.data.result[]?.values[][]' \
          | jq -rM '.message?' \
          | sort | awk 'BEGIN\{pre=-1}{if (pre != -1) \{ print $1; if ($1 - pre != 1) {printf("DETECT: %d must be %d\n", $1, pre + 1)} }; pre=$1}' \
          | grep DETECT -B 10 -A 10
      }
      
      while :
      do
       check && echo "Retry after 5 secs, since it might take a time to reflect all logs on loki..." && sleep 5 && check && break
       sleep 5
      done
      
      echo "The issue was reproduced." 
      
      • Step9. Stop the "gen-loglines" pod, then recheck if logs of "gen-numbers" pod are saved on Loki.
      $ oc delete pods -n test-logloss gen-longlines
      
      • Step10. Wait until log loss a detected.

      Actual results:

      Log loss is detected. The logs from 5778 to 7221 are missing in the following case.

      Mon Jun 30 12:21:08 JST 2025
      00000000000000005769
      00000000000000005770
      00000000000000005771
      00000000000000005772
      00000000000000005773
      00000000000000005774
      00000000000000005775
      00000000000000005776
      00000000000000005777
      00000000000000007222
      DETECT: 7222 must be 5778
      00000000000000007223
      00000000000000007224
      00000000000000007225
      00000000000000007226
      00000000000000007227
      00000000000000007228
      00000000000000007229
      00000000000000007230
      00000000000000007231
      00000000000000007232
      The issue was reproduced.
      

      Vector exits with BufferTooSmall error.

      $ oc logs -n openshift-logging -p collector-loki-299pm | grep BufferTooSmall -B 2 -A 10
      2025-06-30T03:17:10.391079Z  WARN vector::internal_events::file::source: Currently ignoring file too small to fingerprint. file=/var/log/pods/iperf_iperf-kl2qx_9b8923c9-8781-4337-a877-26409c523277/iperf3/0.log
      2025-06-30T03:17:23.281657Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_application_remap_label component_type=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
      2025-06-30T03:17:23.281726Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_application_remap_label component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
      2025-06-30T03:17:23.300447Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_application_remap component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
      2025-06-30T03:17:23.460750Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_route component_type=route}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
      2025-06-30T03:17:23.460997Z ERROR transform{component_kind="transform" component_id=pipeline_pipeline_lokistack_viaqdedot_2 component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
      

      Expected results:

      • Vector doesn't crash with BufferTooSmall.
      • Logs should not be lost.

      Additional info:

      It seems that vector has no logic to prevent a overflow of its internal buffer.

      The internal buffer size of vector is 128MB according the source code[1].
      But it seems that "Events received. count=1000 byte_size=200245626" recorded in 2025-06-27T10:25:12.915594Z exceeds this size.

      This issue would be more likely to occur as the log size per line increases.
      For example, Vector might send 1MB per line as a maximum with CloudWatch[2]. In this case, just count=128 can cause BufferTooSmall error.

      [1] https://github.com/vectordotdev/vector/pull/13121/files#diff-d3541799ff241323f75c0800c291a695b527173ff046b2fd74697e044cc9bc61R13
      [2] https://access.redhat.com/solutions/7116651

              vparfono Vitalii Parfonov
              kkawakam@redhat.com KATSUYA KAWAKAMI
              Anping Li Anping Li
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: