Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-4725

Too far behind errors in fluentd

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Major Major
    • None
    • Logging 5.7.6
    • Log Collection
    • False
    • None
    • False
    • NEW
    • NEW
    • Bug Fix
    • Important

      Description of problem:

      It's observed continually error messages in fluentd as below:

      openshift-logging/logging-loki-ingester-1[loki-ingester]: level=warn ts=2023-10-04T09:17:18.665393472Z caller=grpc_logging.go:43 method=/logproto.Pusher/Push duration=2.273379ms err="rpc error: code = Code(400) desc = entry with timestamp 2023-10-04 07:36:24.87037 +0000 UTC ignored, reason: 'entry too far behind, oldest acceptable timestamp is: 2023-10-04T08:17:16Z' for stream: {fluentd_thread=\"flush_thread_1\", kubernetes_host=\"woker-1.example.com\", log_type=\"audit\"},\nuser 'audit', total ignored: 1 out of 122" msg=gRPC
      
      $ for pod in $(omc get pods -l component=collector -o name); do omc logs $pod -c collector ; done|grep -c "too far behind"
      18085
       

      It's not observed:

       * problems on the node related to memory or cpu
       * problems on the collector where hitting limits (not cpu limits)

      • not buffer files observed in fluentd indicating that fluentd is buffering the logs on its side for any delay delivering the logs to Loki or any other output defined
      • reviewed directly from the log files the audit logs and it's observed the logs generated with the current timestamp
      • all nodes are NTP synchronized and using the same timezone and time

      As not buffer files on the buffer path to Loki, then, it's not known how the logs are and the real content.

      Version-Release number of selected component (if applicable):

      CLO 5.7.6
      Fluentd
      Loki

      How reproducible:

      Not able to reproduce

      Steps to Reproduce:

      Actual results:

      Expected results:

      Additional info:

      • It's needed any guidance in troubleshooting the reason for the "too far behind" errors in fluentd since until now, not able to find any problems as shared above in the node level or even, any constraints on cpu or memory or buffer files on the fluentd side
      • When deleting the audit pos files in one fluentd pod and restarting it, thinking that the problem could be related to fluentd reading old logs and using the old timestamp from the log to Loki and then, this rejecting the log. It was got also the same error for infrastructure logs

              Unassigned Unassigned
              rhn-support-ocasalsa Oscar Casal Sanchez
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: