-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Logging 6.2.3
-
Incidents & Support
-
False
-
-
False
-
NEW
-
NEW
-
Bug Fix
-
Log Collection - Sprint 273, Logging - Sprint 274, Logging - Sprint 275, Logging - Sprint 276, Logging - Sprint 277, Logging - Sprint 278
Description of problem:
The Vector collector crashes with a BufferTooSmall error under certain log conditions and enters a repeated CrashLoopBackOff. This behavior results in high-impact log loss across the cluster, even though disk buffering with AtLeastOnce delivery mode is enabled.
This issue is critical because:
- It affects logs from unrelated pods, not just the one generating large or frequent log lines.
- It can be triggered by non-cluster-admin users, increasing the operational risk.
The following error messages were observed at the time of failure:
2025-06-27T05:51:48.013419Z ERROR transform{component_kind="transform" component_id=output_external_rsyslog_app_parse_encoding component_type=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error. 2025-06-27T05:51:48.013497Z ERROR transform{component_kind="transform" component_id=output_external_rsyslog_app_parse_encoding component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall. 2025-06-27T05:51:48.014013Z INFO vector: Vector has stopped.
Vector restarted immediately, but some of application logs were lost although disk buffer(AtLeastOnce) had been enabled.
Version-Release number of selected component (if applicable):
- OpenShift 4.18.9
- OpenShift Logging 6.2.3
How reproducible:
Frequently (with our reprodution steps)
Steps to Reproduce:
- Step1. Define ClusterLogForwarder object which sets "deliveryMode: AtLeastOnce".
apiVersion: observability.openshift.io/v1 kind: ClusterLogForwarder metadata: ... spec: collector: resources: limits: cpu: 1 memory: 2Gi requests: cpu: 100m memory: 64Mi managementState: Managed outputs: - lokiStack: authentication: token: from: serviceAccount target: name: lokistack namespace: openshift-logging tuning: deliveryMode: AtLeastOnce name: default-lokistack ...
- Step2. Pick up one of Ready node.
node=$(oc get nodes -l node-role.kubernetes.io/worker= | grep " Ready " | awk '\{print $1}' | head -n 1)
- Step3. Create a namespace for test.
$ oc adm new-project --node-selector="kubernetes.io/hostname=$node" test-logloss
- Step4. Deploy an application which outputs numbers in order.
$ oc run gen-numbers --image=image-registry.openshift-image-registry.svc:5000/openshift/cli -n test-logloss -- bash -c 'count=0; while :; do printf "%020d\n" "$count"; ((count=count+1)); sleep 0.1; done'
- Step5. Check if logs of "gen-numbers" pod are saved on Loki.
$ curl \ -H "Authorization: Bearer $(oc create token -n openshift-kube-apiserver localhost-recovery-client --duration=10m)" \ --get -sk \ -H "Content-Type: application/json" \ "https://$(oc get route -n openshift-logging lokistack -o jsonpath='\{.spec.host}')/api/logs/v1/application/loki/api/v1/query_range"\ --data-urlencode 'query=\{ log_type="application" } | json | kubernetes_container_name="gen-numbers" ' \ --data-urlencode "limit=5000" \ | jq -rM '.data.result[]?.values[][]' \ | jq -rM '.message?' \ | sort | awk 'BEGIN\{pre=-1}{if (pre != -1) \{ print $1; if ($1 - pre != 1) {printf("DETECT: %d must be %d\n", $1, pre + 1)} }; pre=$1}' ... 00000000000000005776 00000000000000005777
- Step6. Deploy an application which outputs long line messages repeatedly, which causes a vector crash and log loss.
$ oc run gen-longlines --image=image-registry.openshift-image-registry.svc:5000/openshift/cli -n test-logloss -- bash -c 'len=200k; while :; do echo "`base64 -w0 /dev/urandom | head -c $len`"; done'
- Step7. Check if collector pod enters repeated CrashLoopBackOff.
$ oc get pods -n openshift-logging -w ... collector-loki-299pm 0/1 Completed 4 (4m29s ago) 19m collector-loki-299pm 0/1 CrashLoopBackOff 4 (14s ago) 19m
- Step8. On another terminal, run the following commands, which detects a log loss.
#!/bin/bash function check() { date curl \ -H "Authorization: Bearer $(oc create token -n openshift-kube-apiserver localhost-recovery-client --duration=10m)" \ --get -sk \ -H "Content-Type: application/json" \ "https://$(oc get route -n openshift-logging lokistack -o jsonpath='\{.spec.host}')/api/logs/v1/application/loki/api/v1/query_range"\ --data-urlencode 'query=\{ log_type="application" } | json | kubernetes_container_name="gen-numbers" ' \ --data-urlencode "limit=5000" \ | jq -rM '.data.result[]?.values[][]' \ | jq -rM '.message?' \ | sort | awk 'BEGIN\{pre=-1}{if (pre != -1) \{ print $1; if ($1 - pre != 1) {printf("DETECT: %d must be %d\n", $1, pre + 1)} }; pre=$1}' \ | grep DETECT -B 10 -A 10 } while : do check && echo "Retry after 5 secs, since it might take a time to reflect all logs on loki..." && sleep 5 && check && break sleep 5 done echo "The issue was reproduced."
- Step9. Stop the "gen-loglines" pod, then recheck if logs of "gen-numbers" pod are saved on Loki.
$ oc delete pods -n test-logloss gen-longlines
- Step10. Wait until log loss a detected.
Actual results:
Log loss is detected. The logs from 5778 to 7221 are missing in the following case.
Mon Jun 30 12:21:08 JST 2025 00000000000000005769 00000000000000005770 00000000000000005771 00000000000000005772 00000000000000005773 00000000000000005774 00000000000000005775 00000000000000005776 00000000000000005777 00000000000000007222 DETECT: 7222 must be 5778 00000000000000007223 00000000000000007224 00000000000000007225 00000000000000007226 00000000000000007227 00000000000000007228 00000000000000007229 00000000000000007230 00000000000000007231 00000000000000007232 The issue was reproduced.
Vector exits with BufferTooSmall error.
$ oc logs -n openshift-logging -p collector-loki-299pm | grep BufferTooSmall -B 2 -A 10 2025-06-30T03:17:10.391079Z WARN vector::internal_events::file::source: Currently ignoring file too small to fingerprint. file=/var/log/pods/iperf_iperf-kl2qx_9b8923c9-8781-4337-a877-26409c523277/iperf3/0.log 2025-06-30T03:17:23.281657Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_application_remap_label component_type=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error. 2025-06-30T03:17:23.281726Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_application_remap_label component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall. 2025-06-30T03:17:23.300447Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_application_remap component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected. 2025-06-30T03:17:23.460750Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_route component_type=route}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected. 2025-06-30T03:17:23.460997Z ERROR transform{component_kind="transform" component_id=pipeline_pipeline_lokistack_viaqdedot_2 component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
Expected results:
- Vector doesn't crash with BufferTooSmall.
- Logs should not be lost.
Additional info:
It seems that vector has no logic to prevent a overflow of its internal buffer.
The internal buffer size of vector is 128MB according the source code[1].
But it seems that "Events received. count=1000 byte_size=200245626" recorded in 2025-06-27T10:25:12.915594Z exceeds this size.
This issue would be more likely to occur as the log size per line increases.
For example, Vector might send 1MB per line as a maximum with CloudWatch[2]. In this case, just count=128 can cause BufferTooSmall error.
[1] https://github.com/vectordotdev/vector/pull/13121/files#diff-d3541799ff241323f75c0800c291a695b527173ff046b2fd74697e044cc9bc61R13
[2] https://access.redhat.com/solutions/7116651