-
Bug
-
Resolution: Won't Do
-
Major
-
None
-
Logging 5.8.z
-
False
-
None
-
False
-
NEW
-
NEW
-
-
-
Important
Description of problem:
In a context where both fluentd and vector are seemingly missing a control loop to release deleted files, the Openshift Logging Collector is seen to recurrently (i.e. up to 2-3 occurrences per month) destabilize control plan nodes, leading to situations where the Collector is seen to amount for up to more than 50% of the overall space utilization in a control plane node. [0] [core@$hostname ~]$ df -h /sysroot Filesystem Size Used Avail Use% Mounted on /dev/mapper/coreos-luks-root-nocrypt 120G 94G 26G 79% /sysroot [1] $ cat 0150-lsof.txt | (head -1 && grep audit.*deleted) | awk '{print $(NF-3), $(NF-1)}' | sort | uniq | awk '{ sum+=$1} END {print sum}' | numfmt --field=1 --to=iec 65G [2] $ cat 0150-lsof.txt | grep "(deleted)" | awk '{print $2}' | sort | uniq -c | sort -rn 19488 2754024 1 1378 1 1265 1 1259 [3] $ cat 0150-lsof.txt | (head -1 && grep audit.*deleted | tail -10) COMMAND PID TID TASKCMD USER FD TYPE DEVICE SIZE/OFF NODE NAME fluentd 2754024 2767598 utils.rb: root 2216r REG 253,0 99458001 37776091 /var/log/kube-apiserver/audit-2024-03-04T05-27-26.713.log (deleted) fluentd 2754024 2767598 utils.rb: root 2295r REG 253,0 104782793 37782886 /var/log/kube-apiserver/audit-2024-03-04T12-30-00.272.log (deleted) fluentd 2754024 2767598 utils.rb: root 2314r REG 253,0 103217901 37785868 /var/log/kube-apiserver/audit-2024-03-04T13-30-30.791.log (deleted) fluentd 2754024 2767598 utils.rb: root 2315r REG 253,0 104713321 37785876 /var/log/kube-apiserver/audit-2024-03-04T14-09-51.258.log (deleted) fluentd 2754024 2767598 utils.rb: root 2318r REG 253,0 98998625 37776086 /var/log/kube-apiserver/audit-2024-03-04T04-00-17.847.log (deleted) fluentd 2754024 2767598 utils.rb: root 2320r REG 253,0 104554304 37782871 /var/log/kube-apiserver/audit-2024-03-04T12-00-21.663.log (deleted) fluentd 2754024 2767598 utils.rb: root 2323r REG 253,0 104422225 37782868 /var/log/kube-apiserver/audit-2024-03-04T08-40-21.271.log (deleted) fluentd 2754024 2767598 utils.rb: root 2331r REG 253,0 99131351 37776074 /var/log/kube-apiserver/audit-2024-03-04T03-23-32.920.log (deleted) fluentd 2754024 2767598 utils.rb: root 2345r REG 253,0 104299871 37785874 /var/log/kube-apiserver/audit-2024-03-04T13-31-27.185.log (deleted) fluentd 2754024 2767598 utils.rb: root 2359r REG 253,0 104474582 37785863 /var/log/kube-apiserver/audit-2024-03-04T12-49-41.961.log (deleted)
Version-Release number of selected component (if applicable):
Openshift Logging 5.8.0 and Fluentd 1.16.2
Actual results:
Collector bring a consistent, linear growth space utilization, primarily due to already deleted / unreleased files
Expected results:
The desirable outcomes for this bug would be the following two: (1) For fluentd & vector collectors to have `rotate_wait` implemented for kube-apiserver audit logs, as both fluentd [0] and vector [1] seem to be missing it today. (2) For the collector (i.e. either fluentd or vector) to implement a control loop where it releases [0] https://github.com/openshift/cluster-logging-operator/blob/master/internal/generator/fluentd/source/audit_logs.go#L63-L83 [1] https://github.com/openshift/cluster-logging-operator/blob/master/internal/generator/vector/conf/complex.toml#L45-L49 [2] https://github.com/fluent/fluent-bit/blame/master/plugins/in_tail/tail_file.c#L1773
Additional info:
While [0] and [1] seem they have *fixed* this issue: - The changes brought in [0] only seem to re-base on Fluentd 1.16.2, which brings limited in_tail improvements [2] - While [1] seems to implement rotate_wait_ms, but **only** for kubernetes_logs [3] Based on the above, these two [4][5] KBs should be updated as well, with the above-highlighted information. [0] https://issues.redhat.com/browse/LOG-3949 [1] https://issues.redhat.com/browse/LOG-4241 [2] https://github.com/fluent/fluentd/releases/tag/v1.16.2 [3] https://github.com/openshift/cluster-logging-operator/pull/2185/commits/f528d592a035aac5c579d83bfd2acafd6bd6c493#diff-fdcafccbe68df65cd72c64a6f2aa9ee3935826cdbcc8ee1718cf77bbf43c5582R32 [4] https://access.redhat.com/solutions/7019864 [5] https://access.redhat.com/solutions/7007644