Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-1586

fluentd cannot sync or clean up the buffer when it is over the capacity

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Major Major
    • None
    • Logging 5.0.5
    • Log Collection
    • False
    • False
    • NEW
    • NEW
    • Undefined
    • Logging (Core) - Sprint 209

      We got one cluster with fluentd buffer files filling up the node disk.

      At the beginning the new pods cannot be scheduled to the node due to the disk usage is over 85% and the scheduler will not allow the pod to be created. The fluentd pod cannot run there as well.

      sh-4.4# df -h | grep nvme
      /dev/nvme0n1p4 350G 298G 52G 86% /host
      /dev/nvme0n1p3 364M 190M 151M 56% /host/boot
      sh-4.4# du -sh *
      3.0G default
      277G retry_default

      And then we tried to clean some of the buffer files manually to make some free disk space.

       

      sh-4.4# cd /host/sysroot/ostree/deploy/rhcos/var/lib/fluentd
      sh-4.4# du -sh *
      8.5G default
      84G retry_default
      sh-4.4# df -h .
      Filesystem Size Used Avail Use% Mounted on
      /dev/nvme0n1p4 350G 111G 239G 32% /host/sysroot
      
      

      After that, the fluentd can run well, but there are full of errors in the pod log, it cannot sync the log to elasticsearch and also cannot clean up the old buffers.

      2021-07-21 05:43:12 +0000 [warn]: suppressed same stacktrace
      2021-07-21 05:44:35 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=73 next_retry_seconds=2021-07-21 05:45:41 +0000 chunk="5c66aed6802f04e974fe229e5376a31c" error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full."
       2021-07-21 05:44:35 +0000 [warn]: suppressed same stacktrace
      2021-07-21 05:44:35 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=74 next_retry_seconds=2021-07-21 05:45:37 +0000 chunk="5c66ae46a917f94bf59e60b371bc0246" error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full."
       2021-07-21 05:44:35 +0000 [warn]: suppressed same stacktrace
      2021-07-21 05:45:55 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=75 next_retry_seconds=2021-07-21 05:46:51 +0000 chunk="5c66ae46a917f94bf59e60b371bc0246" error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full."
       2021-07-21 05:45:55 +0000 [warn]: suppressed same stacktrace
      2021-07-21 05:45:56 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=76 next_retry_seconds=2021-07-21 05:46:56 +0000 chunk="5c66aed6802f04e974fe229e5376a31c" error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full."

       

       

      Version-Release number of selected component (if applicable):
      openshift v4.7.19
      cluster-logging 5.0.5-11

      How reproducible:
      not sure

      Steps to Reproduce:
      1. as description
      2.
      3.

      Actual results:
      The buffer file cannot be synced or cleaned up.

      Expected results:
      The fluentd service should be able to handle the buffers if they are go up higher than capacity.

      Additional info:

        1. fluentd-ds.yaml
          18 kB
          Bo Meng
        2. fluentd-configmap.txt
          20 kB
          Bo Meng
        3. fluentd_buffer.txt
          638 kB
          Bo Meng

              jcantril@redhat.com Jeffrey Cantrill
              bmeng_sre.openshift Bo Meng
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: