Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-7873

[Logging6.2][Vector] Log forwarding stops after OOM; disk buffer file corruption causes deserialization error

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • Logging 6.2.z
    • Log Collection
    • None
    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • False
    • NEW
    • NEW
    • Bug Fix

      Description of problem:

      After the collector pod hits OOM and restarts, The pod looks Running, but no logs are sent.
      I see an error that "Last written record was unable to be deserialized". This happens because the disk buffer file was corrupted.
      This blocks log forwarding for a long time, and this is a loss resilience on OpenShift Logging system.
      This means the system looks healthy from outside, but in fact no logs are sent to the destination.
      It can cause monitoring tools to miss important events and make troubleshooting very hard.

      This issue is confirmed in upstream in https://github.com/vectordotdev/vector/pull/23617.

      Version-Release number of selected component (if applicable):

      OpenShift 4.18.9
      OpenShift Logging 6.2.5

      How reproducible:

      Always

      Steps to Reproduce:

      We’ve created a simplified reproduction procedure that doesn’t require deploying an OpenShift cluster.
      This setup defines a vector configuration equivalent to CLO’s AtLeastOnece delivery option, which is the key point of the reproduction.
      The same behavior was observed in OpenShift as well, but this minimal setup allows you to reproduce the issue without the complexity of setting up a full cluster.

      • Step1. Create a two directory.
        $ mkdir -p conf data
        
      • Step2. Create conf/vector.yaml with the following contents.
        data_dir: /data
        sources:
          events:
            type: file
            include:
              - /data/*.json
            ignore_checkpoints: true
            fingerprinting:
              strategy: device_and_inode
            max_line_bytes: 524288000
        sinks:
         console:
           type: console
           inputs:
             - split_events
           buffer:
             type: disk
             max_size: 268435488
             when_full: block
           encoding:
             codec: text
           target: stdout
        transforms:
          split_events:
            type: remap
            inputs:
              - events
            source: |-
              . = parse_json!(.message)
        
      • Step3. Generate sample json files, which have 1MB message per line.
        $ python3 << _EOF_
        import json
        import os
        
        # Directory to store the test files
        output_dir = "data" 
        os.makedirs(output_dir, exist_ok=True)
        
        # Create a single large event (~1MB)
        nested_json = {
            "user": "test_user",
            "action": "upload",
            "metadata": {
                "filename": "example.txt",
                "size": "1MB",
                "tags": ["test", "vector", "buffer"],
                "description": "A" * (1 * 1024 * 1024 - 200)
            }
        }
        event = {"message": json.dumps(nested_json)}
        
        # Serialize the event to a string
        event_str = json.dumps(event) + "\n" 
        event_size = len(event_str.encode("utf-8"))
        
        # Calculate how many events to reach ~50MB
        target_size = 50 * 1024 * 1024  # 50MB
        num_events = target_size // event_size
        
        # Create 3 files with repeated events
        for i in range(3):
            file_path = os.path.join(output_dir, f"test_events_{i+1}.json")
            with open(file_path, "w") as f:
                for _ in range(num_events):
                    f.write(event_str)
        
        print(f"Created 3 test files in {output_dir}, each approximately 50MB.")
        _EOF_
        
      • Step4. Check 3 json files are created.
        $ ls -l data
        -rw-r--r--. 1 test test 52428800 Oct 14 14:16 test_events_1.json
        -rw-r--r--. 1 test test 52428800 Oct 14 14:16 test_events_2.json
        -rw-r--r--. 1 test test 52428800 Oct 14 14:16 test_events_3.json
        
      • Step5. Pull image, Logging 6.2.5's vector image.
        $ podman pull registry.redhat.io/openshift-logging/vector-rhel9@sha256:c76c853565fc93da88ba0116425839cfb86cab63cbde9856d7cca78278d7758a
        
      • Step6. To generate buffer files of vector, run vector once.
        $ rm -rf data/buffer/ data/org data/events/
        $ timeout 15 podman run --pull always --rm -v `pwd`/conf:/etc/vector:Z -v `pwd`/data:/data:Z  --entrypoint vector docker://registry.redhat.io/openshift-logging/vector-rhel9@sha256:c76c853565fc93da88ba0116425839cfb86cab63cbde9856d7cca78278d7758a --config /etc/vector/vector.yaml -vvv| cut -c1-180
        ...
        2025-10-14T05:21:27.714740Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="split_events, console" time_remaining="59 seconds left" 
        2025-10-14T05:21:27.724998Z DEBUG transform{component_kind="transform" component_id=split_events component_type=remap}: vector::topology::builder: Synchronous transform finished normally.
        2025-10-14T05:21:27.731238Z DEBUG sink{component_kind="sink" component_id=console component_type=console}: vector::topology::builder: Sink finished normally.
        Terminated
        
      • Step7. Check if buffer-data-*.dat exists.
        $ ls -l data/buffer/v2/console/
        total 110600
        -rw-r-----. 1 test test 113247760 Oct 14 14:21 buffer-data-1.dat
        -rw-r-----. 1 test test        24 Oct 14 14:21 buffer.db
        -rw-r--r--. 1 test test         0 Oct 14 14:21 buffer.lock
        
      • Step8. Truncate the buffer-data-*.dat as to simulate buffer data corruption.
        $ cp -pR data/buffer/v2/console/ data/buffer/v2/console.corrupt
        $ truncate -s 113247000 data/buffer/v2/console.corrupt/buffer-data-1.dat
        $ ls -l data/buffer/v2/console.corrupt/
        total 110600
        -rw-r-----. 1 test test 113247000 Oct 14 14:22 buffer-data-1.dat
        -rw-r-----. 1 test test        24 Oct 14 14:21 buffer.db
        -rw-r--r--. 1 test test         0 Oct 14 14:21 buffer.lock
        
      • Step9. Run the vector images in the situation of data corruption like OOM.
        Test with Logging 6.2.5's vector image:
        
        $ rm -rf data/buffer/v2/console
        $ cp -pR data/buffer/v2/console.corrupt data/buffer/v2/console
        $ timeout 15 podman run --pull always --rm -v `pwd`/conf:/etc/vector:Z -v `pwd`/data:/data:Z  --entrypoint vector docker://registry.redhat.io/openshift-logging/vector-rhel9@sha256:c76c853565fc93da88ba0116425839cfb86cab63cbde9856d7cca78278d7758a --config /etc/vector/vector.yaml -vvv| cut -c1-180
        

      Actual results:

      • the pod prints errors and gets stuck. We see:" Last written record was unable to be deserialized."
      • It also loops like “Incremented acknowledged reader file ID …” many times.
      • Log forwarding stops
        • The pod shows Running, but the destination log file did not change for 15+ minutes.

      Expected results:

      After OOM and buffer corruption, the collector should recover and keep sending logs.

      Additional information:

      I tested with docker.io/timberio/vector:nightly-distroless-libc , Vector's latest, image and this issue did not reproduce.
      I believe the fix in vector upstream PR#23617 is working.

              Unassigned Unassigned
              kkawakam@redhat.com KATSUYA KAWAKAMI
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: