Loading...

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: Logging 6.2.z
Affects Version/s: Logging 6.2.z
Component/s: Log Collection
Labels:
- devel_ack+

Activity Type:
Incidents & Support
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW
Release Note Text:

Hide
Before this update, buffer corruption (often occurring after an Out-of-Memory event) leading to a "Last written record was unable to be deserialized" error caused the collector pod to get stuck and stop forwarding logs. With this update, improved recovery logic for corrupted buffer files resolves the issue, and the collector can now successfully recover from this state and resume sending logs.

Show
Before this update, buffer corruption (often occurring after an Out-of-Memory event) leading to a "Last written record was unable to be deserialized" error caused the collector pod to get stuck and stop forwarding logs. With this update, improved recovery logic for corrupted buffer files resolves the issue, and the collector can now successfully recover from this state and resume sending logs.
Release Note Type:
Bug Fix

Sprint:
Logging - Sprint 278, Logging - Sprint 279, Logging - Sprint 280

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

After the collector pod hits OOM and restarts, The pod looks Running, but no logs are sent.
I see an error that "Last written record was unable to be deserialized". This happens because the disk buffer file was corrupted.
This blocks log forwarding for a long time, and this is a loss resilience on OpenShift Logging system.
This means the system looks healthy from outside, but in fact no logs are sent to the destination.
It can cause monitoring tools to miss important events and make troubleshooting very hard.

This issue is confirmed in upstream in https://github.com/vectordotdev/vector/pull/23617.

Version-Release number of selected component (if applicable):

OpenShift 4.18.9
OpenShift Logging 6.2.5

How reproducible:

Always

Steps to Reproduce:

We’ve created a simplified reproduction procedure that doesn’t require deploying an OpenShift cluster.
This setup defines a vector configuration equivalent to CLO’s AtLeastOnece delivery option, which is the key point of the reproduction.
The same behavior was observed in OpenShift as well, but this minimal setup allows you to reproduce the issue without the complexity of setting up a full cluster.

Step1. Create a two directory.
```
$ mkdir -p conf data
```

Step2. Create conf/vector.yaml with the following contents.

data_dir: /data
sources:
  events:
    type: file
    include:
      - /data/*.json
    ignore_checkpoints: true
    fingerprinting:
      strategy: device_and_inode
    max_line_bytes: 524288000
sinks:
 console:
   type: console
   inputs:
     - split_events
   buffer:
     type: disk
     max_size: 268435488
     when_full: block
   encoding:
     codec: text
   target: stdout
transforms:
  split_events:
    type: remap
    inputs:
      - events
    source: |-
      . = parse_json!(.message)

Step3. Generate sample json files, which have 1MB message per line.

$ python3 << _EOF_
import json
import os

# Directory to store the test files
output_dir = "data" 
os.makedirs(output_dir, exist_ok=True)

# Create a single large event (~1MB)
nested_json = {
    "user": "test_user",
    "action": "upload",
    "metadata": {
        "filename": "example.txt",
        "size": "1MB",
        "tags": ["test", "vector", "buffer"],
        "description": "A" * (1 * 1024 * 1024 - 200)
    }
}
event = {"message": json.dumps(nested_json)}

# Serialize the event to a string
event_str = json.dumps(event) + "\n" 
event_size = len(event_str.encode("utf-8"))

# Calculate how many events to reach ~50MB
target_size = 50 * 1024 * 1024  # 50MB
num_events = target_size // event_size

# Create 3 files with repeated events
for i in range(3):
    file_path = os.path.join(output_dir, f"test_events_{i+1}.json")
    with open(file_path, "w") as f:
        for _ in range(num_events):
            f.write(event_str)

print(f"Created 3 test files in {output_dir}, each approximately 50MB.")
_EOF_

Step4. Check 3 json files are created.

$ ls -l data
-rw-r--r--. 1 test test 52428800 Oct 14 14:16 test_events_1.json
-rw-r--r--. 1 test test 52428800 Oct 14 14:16 test_events_2.json
-rw-r--r--. 1 test test 52428800 Oct 14 14:16 test_events_3.json

Step5. Pull image, Logging 6.2.5's vector image.

$ podman pull registry.redhat.io/openshift-logging/vector-rhel9@sha256:c76c853565fc93da88ba0116425839cfb86cab63cbde9856d7cca78278d7758a

Step6. To generate buffer files of vector, run vector once.

$ rm -rf data/buffer/ data/org data/events/
$ timeout 15 podman run --pull always --rm -v `pwd`/conf:/etc/vector:Z -v `pwd`/data:/data:Z  --entrypoint vector docker://registry.redhat.io/openshift-logging/vector-rhel9@sha256:c76c853565fc93da88ba0116425839cfb86cab63cbde9856d7cca78278d7758a --config /etc/vector/vector.yaml -vvv| cut -c1-180
...
2025-10-14T05:21:27.714740Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="split_events, console" time_remaining="59 seconds left" 
2025-10-14T05:21:27.724998Z DEBUG transform{component_kind="transform" component_id=split_events component_type=remap}: vector::topology::builder: Synchronous transform finished normally.
2025-10-14T05:21:27.731238Z DEBUG sink{component_kind="sink" component_id=console component_type=console}: vector::topology::builder: Sink finished normally.
Terminated

Step7. Check if buffer-data-*.dat exists.

$ ls -l data/buffer/v2/console/
total 110600
-rw-r-----. 1 test test 113247760 Oct 14 14:21 buffer-data-1.dat
-rw-r-----. 1 test test        24 Oct 14 14:21 buffer.db
-rw-r--r--. 1 test test         0 Oct 14 14:21 buffer.lock

Step8. Truncate the buffer-data-*.dat as to simulate buffer data corruption.

$ cp -pR data/buffer/v2/console/ data/buffer/v2/console.corrupt
$ truncate -s 113247000 data/buffer/v2/console.corrupt/buffer-data-1.dat
$ ls -l data/buffer/v2/console.corrupt/
total 110600
-rw-r-----. 1 test test 113247000 Oct 14 14:22 buffer-data-1.dat
-rw-r-----. 1 test test        24 Oct 14 14:21 buffer.db
-rw-r--r--. 1 test test         0 Oct 14 14:21 buffer.lock

Step9. Run the vector images in the situation of data corruption like OOM.

Test with Logging 6.2.5's vector image:

$ rm -rf data/buffer/v2/console
$ cp -pR data/buffer/v2/console.corrupt data/buffer/v2/console
$ timeout 15 podman run --pull always --rm -v `pwd`/conf:/etc/vector:Z -v `pwd`/data:/data:Z  --entrypoint vector docker://registry.redhat.io/openshift-logging/vector-rhel9@sha256:c76c853565fc93da88ba0116425839cfb86cab63cbde9856d7cca78278d7758a --config /etc/vector/vector.yaml -vvv| cut -c1-180

Actual results:

the pod prints errors and gets stuck. We see:" Last written record was unable to be deserialized."
It also loops like “Incremented acknowledged reader file ID …” many times.
Log forwarding stops
- The pod shows Running, but the destination log file did not change for 15+ minutes.

Expected results:

After OOM and buffer corruption, the collector should recover and keep sending logs.

Additional information:

I tested with docker.io/timberio/vector:nightly-distroless-libc , Vector's latest, image and this issue did not reproduce.
I believe the fix in vector upstream PR#23617 is working.

is cloned by

LOG-7947 [Vector] Log forwarding stops after OOM; disk buffer file corruption causes deserialization error

MODIFIED

LOG-7996 [Logging6.3][Vector] Log forwarding stops after OOM; disk buffer file corruption causes deserialization error

MODIFIED

LOG-7994 [Logging6.0][Vector] Log forwarding stops after OOM; disk buffer file corruption causes deserialization error

Closed

LOG-7995 [Logging6.1][Vector] Log forwarding stops after OOM; disk buffer file corruption causes deserialization error

Closed

links to

related upstream issue

ViaQ/vector#235: [v0.37.1]LOG-7873: fix panic in disk buffer when dealing with corrupted file

(1 links to)

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional information:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates