-
Bug
-
Resolution: Won't Do
-
Major
-
None
-
Logging 5.0.5
-
False
-
False
-
NEW
-
NEW
-
Undefined
-
-
Logging (Core) - Sprint 209
We got one cluster with fluentd buffer files filling up the node disk.
At the beginning the new pods cannot be scheduled to the node due to the disk usage is over 85% and the scheduler will not allow the pod to be created. The fluentd pod cannot run there as well.
sh-4.4# df -h | grep nvme /dev/nvme0n1p4 350G 298G 52G 86% /host /dev/nvme0n1p3 364M 190M 151M 56% /host/boot sh-4.4# du -sh * 3.0G default 277G retry_default
And then we tried to clean some of the buffer files manually to make some free disk space.
sh-4.4# cd /host/sysroot/ostree/deploy/rhcos/var/lib/fluentd sh-4.4# du -sh * 8.5G default 84G retry_default sh-4.4# df -h . Filesystem Size Used Avail Use% Mounted on /dev/nvme0n1p4 350G 111G 239G 32% /host/sysroot
After that, the fluentd can run well, but there are full of errors in the pod log, it cannot sync the log to elasticsearch and also cannot clean up the old buffers.
2021-07-21 05:43:12 +0000 [warn]: suppressed same stacktrace 2021-07-21 05:44:35 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=73 next_retry_seconds=2021-07-21 05:45:41 +0000 chunk="5c66aed6802f04e974fe229e5376a31c" error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full." 2021-07-21 05:44:35 +0000 [warn]: suppressed same stacktrace 2021-07-21 05:44:35 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=74 next_retry_seconds=2021-07-21 05:45:37 +0000 chunk="5c66ae46a917f94bf59e60b371bc0246" error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full." 2021-07-21 05:44:35 +0000 [warn]: suppressed same stacktrace 2021-07-21 05:45:55 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=75 next_retry_seconds=2021-07-21 05:46:51 +0000 chunk="5c66ae46a917f94bf59e60b371bc0246" error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full." 2021-07-21 05:45:55 +0000 [warn]: suppressed same stacktrace 2021-07-21 05:45:56 +0000 [warn]: [retry_default] failed to flush the buffer. retry_time=76 next_retry_seconds=2021-07-21 05:46:56 +0000 chunk="5c66aed6802f04e974fe229e5376a31c" error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full."
Version-Release number of selected component (if applicable):
openshift v4.7.19
cluster-logging 5.0.5-11
How reproducible:
not sure
Steps to Reproduce:
1. as description
2.
3.
Actual results:
The buffer file cannot be synced or cleaned up.
Expected results:
The fluentd service should be able to handle the buffers if they are go up higher than capacity.
Additional info: