Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Logging 6.2.3
Component/s: Log Collection
Labels:
- devel_ack+

Activity Type:
Incidents & Support
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW
Release Note Type:
Bug Fix

Sprint:
Log Collection - Sprint 273, Logging - Sprint 274, Logging - Sprint 275, Logging - Sprint 276, Logging - Sprint 277, Logging - Sprint 278

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

The Vector collector crashes with a BufferTooSmall error under certain log conditions and enters a repeated CrashLoopBackOff. This behavior results in high-impact log loss across the cluster, even though disk buffering with AtLeastOnce delivery mode is enabled.

This issue is critical because:

It affects logs from unrelated pods, not just the one generating large or frequent log lines.
It can be triggered by non-cluster-admin users, increasing the operational risk.

The following error messages were observed at the time of failure:

2025-06-27T05:51:48.013419Z ERROR transform{component_kind="transform" component_id=output_external_rsyslog_app_parse_encoding component_type=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
2025-06-27T05:51:48.013497Z ERROR transform{component_kind="transform" component_id=output_external_rsyslog_app_parse_encoding component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
2025-06-27T05:51:48.014013Z INFO vector: Vector has stopped.

Vector restarted immediately, but some of application logs were lost although disk buffer(AtLeastOnce) had been enabled.

Version-Release number of selected component (if applicable):

OpenShift 4.18.9
OpenShift Logging 6.2.3

How reproducible:

Frequently (with our reprodution steps)

Steps to Reproduce:

Step1. Define ClusterLogForwarder object which sets "deliveryMode: AtLeastOnce".

apiVersion: observability.openshift.io/v1
kind: ClusterLogForwarder
metadata:
  ...
spec:
  collector:
    resources:
      limits:
        cpu: 1
        memory: 2Gi
      requests:
        cpu: 100m
        memory: 64Mi
  managementState: Managed
  outputs:
  - lokiStack:
      authentication:
        token:
          from: serviceAccount
      target:
        name: lokistack
        namespace: openshift-logging
      tuning:
        deliveryMode: AtLeastOnce
    name: default-lokistack
...

Step2. Pick up one of Ready node.

node=$(oc get nodes -l node-role.kubernetes.io/worker= | grep " Ready " | awk '\{print $1}' | head -n 1)

Step3. Create a namespace for test.

$ oc adm new-project --node-selector="kubernetes.io/hostname=$node" test-logloss

Step4. Deploy an application which outputs numbers in order.

$ oc run gen-numbers --image=image-registry.openshift-image-registry.svc:5000/openshift/cli -n test-logloss --  bash -c 'count=0; while :; do printf "%020d\n" "$count"; ((count=count+1)); sleep 0.1; done'

Step5. Check if logs of "gen-numbers" pod are saved on Loki.

$ curl \
     -H "Authorization: Bearer $(oc create token -n openshift-kube-apiserver localhost-recovery-client --duration=10m)" \
     --get -sk \
     -H "Content-Type: application/json" \
     "https://$(oc get route -n openshift-logging lokistack -o jsonpath='\{.spec.host}')/api/logs/v1/application/loki/api/v1/query_range"\
     --data-urlencode 'query=\{ log_type="application" } | json | kubernetes_container_name="gen-numbers" ' \
     --data-urlencode "limit=5000" \
   | jq -rM '.data.result[]?.values[][]' \
   | jq -rM '.message?' \
   | sort | awk 'BEGIN\{pre=-1}{if (pre != -1) \{ print $1; if ($1 - pre != 1) {printf("DETECT: %d must be %d\n", $1, pre + 1)} }; pre=$1}'
...
00000000000000005776
00000000000000005777

Step6. Deploy an application which outputs long line messages repeatedly, which causes a vector crash and log loss.

$ oc run gen-longlines --image=image-registry.openshift-image-registry.svc:5000/openshift/cli -n test-logloss --  bash -c 'len=200k; while :; do echo "`base64 -w0 /dev/urandom | head -c $len`"; done'

Step7. Check if collector pod enters repeated CrashLoopBackOff.

$ oc get pods -n openshift-logging -w
...
collector-loki-299pm                        0/1     Completed   4 (4m29s ago)   19m
collector-loki-299pm                        0/1     CrashLoopBackOff   4 (14s ago)     19m

Step8. On another terminal, run the following commands, which detects a log loss.

#!/bin/bash

function check()
{
  date
  curl \
      -H "Authorization: Bearer $(oc create token -n openshift-kube-apiserver localhost-recovery-client --duration=10m)" \
      --get -sk \
      -H "Content-Type: application/json" \
      "https://$(oc get route -n openshift-logging lokistack -o jsonpath='\{.spec.host}')/api/logs/v1/application/loki/api/v1/query_range"\
      --data-urlencode 'query=\{ log_type="application" } | json | kubernetes_container_name="gen-numbers" ' \
      --data-urlencode "limit=5000" \
    | jq -rM '.data.result[]?.values[][]' \
    | jq -rM '.message?' \
    | sort | awk 'BEGIN\{pre=-1}{if (pre != -1) \{ print $1; if ($1 - pre != 1) {printf("DETECT: %d must be %d\n", $1, pre + 1)} }; pre=$1}' \
    | grep DETECT -B 10 -A 10
}

while :
do
 check && echo "Retry after 5 secs, since it might take a time to reflect all logs on loki..." && sleep 5 && check && break
 sleep 5
done

echo "The issue was reproduced."

Step9. Stop the "gen-loglines" pod, then recheck if logs of "gen-numbers" pod are saved on Loki.

$ oc delete pods -n test-logloss gen-longlines

Step10. Wait until log loss a detected.

Actual results:

Log loss is detected. The logs from 5778 to 7221 are missing in the following case.

Mon Jun 30 12:21:08 JST 2025
00000000000000005769
00000000000000005770
00000000000000005771
00000000000000005772
00000000000000005773
00000000000000005774
00000000000000005775
00000000000000005776
00000000000000005777
00000000000000007222
DETECT: 7222 must be 5778
00000000000000007223
00000000000000007224
00000000000000007225
00000000000000007226
00000000000000007227
00000000000000007228
00000000000000007229
00000000000000007230
00000000000000007231
00000000000000007232
The issue was reproduced.

Vector exits with BufferTooSmall error.

$ oc logs -n openshift-logging -p collector-loki-299pm | grep BufferTooSmall -B 2 -A 10
2025-06-30T03:17:10.391079Z  WARN vector::internal_events::file::source: Currently ignoring file too small to fingerprint. file=/var/log/pods/iperf_iperf-kl2qx_9b8923c9-8781-4337-a877-26409c523277/iperf3/0.log
2025-06-30T03:17:23.281657Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_application_remap_label component_type=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
2025-06-30T03:17:23.281726Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_application_remap_label component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
2025-06-30T03:17:23.300447Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_application_remap component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
2025-06-30T03:17:23.460750Z ERROR transform{component_kind="transform" component_id=output_default_lokistack_route component_type=route}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
2025-06-30T03:17:23.460997Z ERROR transform{component_kind="transform" component_id=pipeline_pipeline_lokistack_viaqdedot_2 component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.

Expected results:

Vector doesn't crash with BufferTooSmall.
Logs should not be lost.

Additional info:

It seems that vector has no logic to prevent a overflow of its internal buffer.

The internal buffer size of vector is 128MB according the source code[1].
But it seems that "Events received. count=1000 byte_size=200245626" recorded in 2025-06-27T10:25:12.915594Z exceeds this size.

This issue would be more likely to occur as the log size per line increases.
For example, Vector might send 1MB per line as a maximum with CloudWatch[2]. In this case, just count=128 can cause BufferTooSmall error.

[1] https://github.com/vectordotdev/vector/pull/13121/files#diff-d3541799ff241323f75c0800c291a695b527173ff046b2fd74697e044cc9bc61R13
[2] https://access.redhat.com/solutions/7116651

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

gen-numbers.csv
2025/07/03 2:39 PM
10.88 MB
Vitalii Parfonov

links to

openshift/cluster-logging-operator#3097: Allow specification of a maximum line size to be applied after merging

ViaQ/vector#216: Allow specification of a maximum line size to be applied after merging

Assignee:: Vitalii Parfonov

Reporter:: KATSUYA KAWAKAMI

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2025/06/30 2:40 PM

Updated:: 2025/10/06 12:13 PM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates