Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-7347

[Logging6.2.3][Vector] Vector crashes with BufferTooSmall and enters repeated CrashLoopBackOff, causing high-impact log loss across the cluster (even with disk buffer enabled)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • Logging 6.2.3
    • Log Collection
    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • False
    • NEW
    • NEW
    • Bug Fix
    • Log Collection - Sprint 273, Logging - Sprint 274, Logging - Sprint 275, Logging - Sprint 276, Logging - Sprint 277, Logging - Sprint 278

      Description of problem:

      The Vector collector crashes with a BufferTooSmall error under certain log conditions and enters a repeated CrashLoopBackOff. This behavior results in high-impact log loss across the cluster, even though disk buffering with AtLeastOnce delivery mode is enabled.

      This issue is critical because:

      • It affects logs from unrelated pods, not just the one generating large or frequent log lines.
      • It can be triggered by non-cluster-admin users, increasing the operational risk.

      The following error messages were observed at the time of failure:

      2025-06-27T05:51:48.013419Z ERROR transform{component_kind="transform" component_id=output_external_rsyslog_app_parse_encoding component_type=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
      2025-06-27T05:51:48.013497Z ERROR transform{component_kind="transform" component_id=output_external_rsyslog_app_parse_encoding component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
      2025-06-27T05:51:48.014013Z INFO vector: Vector has stopped.
      

      Vector restarted immediately, but some of application logs were lost although disk buffer(AtLeastOnce) had been enabled.

      Version-Release number of selected component (if applicable):

      • OpenShift 4.18.9
      • OpenShift Logging 6.2.3

      How reproducible:

      Frequently (with our reprodution steps)

      Steps to Reproduce:

      • Step1. Define ClusterLogForwarder object which sets "deliveryMode: AtLeastOnce".
      apiVersion: observability.openshift.io/v1
      kind: ClusterLogForwarder
      metadata:
        ...
      spec:
        collector:
          resources:
            limits:
              cpu: 1
              memory: 2Gi
            requests:
              cpu: 100m
              memory: 64Mi
        managementState: Managed
        outputs:
        - lokiStack:
            authentication:
              token:
                from: serviceAccount
            target:
              name: lokistack
              namespace: openshift-logging
            tuning:
              deliveryMode: AtLeastOnce
          name: default-lokistack
      ...
      
      • Step2. Pick up one of Ready node.
      node=$(oc get nodes -l node-role.kubernetes.io/worker= | grep " Ready " | awk '\{print $1}' | head -n 1)
      
      • Step3. Create a namespace for test.
      $ oc adm new-project --node-selector="kubernetes.io/hostname=$node" test-logloss
      
      • Step4. Deploy an application which outputs numbers in order.
      $ oc run gen-numbers --image=image-registry.openshift-image-registry.svc:5000/openshift/cli -n test-logloss --  bash -c 'count=0; while :; do printf "%020d\n" "$count"; ((count=count+1)); sleep 0.1; done'
      
      • Step5. Check if logs of "gen-numbers" pod are saved on Loki.
      $ curl \
           -H "Authorization: Bearer $(oc create token -n openshift-kube-apiserver localhost-recovery-client --duration=10m)" \
           --get -sk \
           -H "Content-Type: application/json" \
           "https://$(oc get route -n openshift-logging lokistack -o jsonpath='\{.spec.host}')/api/logs/v1/application/loki/api/v1/query_range"\
           --data-urlencode 'query=\{ log_type="application" } | json | kubernetes_container_name="gen-numbers" ' \
           --data-urlencode "limit=5000" \
         | jq -rM '.data.result[]?.values[][]' \
         | jq -rM '.message?' \
         | sort | awk 'BEGIN\{pre=-1}{if (pre != -1) \{ print $1; if ($1 - pre != 1) {printf("DETECT: %d must be %d\n", $1, pre + 1)} }; pre=$1}'
      ...
      00000000000000005776
      00000000000000005777
      
      • Step6. Deploy an application which outputs long line messages repeatedly, which causes a vector crash and log loss.
      $ oc run gen-longlines --image=image-registry.openshift-image-registry.svc:5000/openshift/cli -n test-logloss --  bash -c 'len=200k; while :; do echo "`base64 -w0 /dev/urandom | head -c $len`"; done'
      
      • Step7. Check if collector pod enters repeated CrashLoopBackOff.
      $ oc get pods -n openshift-logging -w
      ...
      collector-loki-299pm                        0/1     Completed   4 (4m29s ago)   19m
      collector-loki-299pm                        0/1     CrashLoopBackOff   4 (14s ago)     19m
      
      • Step8. On another terminal, run the following commands, which detects a log loss.
      #!/bin/bash
      
      function check()
      {
        date
        curl \
            -H "Authorization: Bearer $(oc create token -n openshift-kube-apiserver localhost-recovery-client --duration=10m)" \
            --get -sk \
            -H "Content-Type: application/json" \
            "https://$(oc get route -n openshift-logging lokistack -o jsonpath='\{.spec.host}')/api/logs/v1/application/loki/api/v1/query_range"\
            --data-urlencode 'query=\{ log_type="application" } | json | kubernetes_container_name="gen-numbers" ' \
            --data-urlencode "limit=5000" \
          | jq -rM '.data.result[]?.values[][]' \
          | jq -rM '.message?' \
          | sort | awk 'BEGIN\{pre=-1}{if (pre != -1) \{ print $1; if ($1 - pre != 1) {printf("DETECT: %d must be %d\n", $1, pre + 1)} }; pre=$1}' \
          | grep DETECT -B 10 -A 10
      }
      
      while :
      do
       check && echo "Retry after 5 secs, since it might take a time to reflect all logs on loki..." && sleep 5 && check && break
       sleep 5
      done
      
      echo "The issue was reproduced." 
      
      • Step9. Stop the "gen-loglines" pod, then recheck if logs of "gen-numbers" pod are saved on Loki.
      $ oc delete pods -n test-logloss gen-longlines
      
      • Step10. Wait until log loss a detected.

      Actual results:

      Log loss is detected. The logs from 5778 to 7221 are missing in the following case.

      Mon Jun 30 12:21:08 JST 2025
      00000000000000005769
      00000000000000005770
      00000000000000005771
      00000000000000005772
      00000000000000005773
      00000000000000005774
      00000000000000005775
      00000000000000005776
      00000000000000005777
      00000000000000007222
      DETECT: 7222 must be 5778
      00000000000000007223
      00000000000000007224
      00000000000000007225
      00000000000000007226
      00000000000000007227
      00000000000000007228
      00000000000000007229
      00000000000000007230
      00000000000000007231
      00000000000000007232
      The issue was reproduced.
      

      Vector exits with BufferTooSmall error.

      $ oc logs -n openshift-logging -p collector-loki-299pm | grep BufferTooSmall -B 2 -A 10
      2025-06-30T03:17:10.391079Z  WARN vector::internal_events::file::source: Currently ignoring file too small to fingerprint. file=/var/log/pods/iperf_iperf-kl2qx_9b8923c9-8781-4337-a877-26409c523277/iperf3/0.log
      2025-06-30T03:17:23.281657Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_application_remap_label component_type=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
      2025-06-30T03:17:23.281726Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_application_remap_label component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
      2025-06-30T03:17:23.300447Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_application_remap component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
      2025-06-30T03:17:23.460750Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_route component_type=route}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
      2025-06-30T03:17:23.460997Z ERROR transform{component_kind="transform" component_id=pipeline_pipeline_lokistack_viaqdedot_2 component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
      

      Expected results:

      • Vector doesn't crash with BufferTooSmall.
      • Logs should not be lost.

      Additional info:

      It seems that vector has no logic to prevent a overflow of its internal buffer.

      The internal buffer size of vector is 128MB according the source code[1].
      But it seems that "Events received. count=1000 byte_size=200245626" recorded in 2025-06-27T10:25:12.915594Z exceeds this size.

      This issue would be more likely to occur as the log size per line increases.
      For example, Vector might send 1MB per line as a maximum with CloudWatch[2]. In this case, just count=128 can cause BufferTooSmall error.

      [1] https://github.com/vectordotdev/vector/pull/13121/files#diff-d3541799ff241323f75c0800c291a695b527173ff046b2fd74697e044cc9bc61R13
      [2] https://access.redhat.com/solutions/7116651

        1. gen-numbers.csv
          10.88 MB
          Vitalii Parfonov

              vparfono Vitalii Parfonov
              kkawakam@redhat.com KATSUYA KAWAKAMI
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: