Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-5169

collector keeps deleted file open, leading to control plane disk pressure

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • Logging 5.8.z
    • Log Collection
    • False
    • None
    • False
    • NEW
    • NEW
    • Important

      Description of problem:

      In a context where both fluentd and vector are seemingly missing a control loop to release deleted files, the Openshift Logging Collector is seen to recurrently (i.e. up to 2-3 occurrences per month) destabilize control plan nodes, leading to situations where the Collector is seen to amount for up to more than 50% of the overall space utilization in a control plane node.
      
      [0]
      [core@$hostname ~]$ df -h /sysroot
      Filesystem                            Size  Used Avail Use% Mounted on
      /dev/mapper/coreos-luks-root-nocrypt  120G   94G   26G  79% /sysroot
      
      [1]
      $ cat 0150-lsof.txt  | (head -1 && grep audit.*deleted) | awk '{print $(NF-3), $(NF-1)}' | sort | uniq  | awk '{ sum+=$1} END {print sum}' | numfmt --field=1 --to=iec
      65G
      
      [2]
      $ cat 0150-lsof.txt  | grep "(deleted)" | awk '{print $2}' | sort | uniq -c | sort -rn
        19488 2754024
            1 1378
            1 1265
            1 1259
      
      [3]
      $ cat 0150-lsof.txt  | (head -1 && grep audit.*deleted | tail -10)
      COMMAND       PID     TID TASKCMD       USER   FD      TYPE             DEVICE   SIZE/OFF       NODE NAME
      fluentd   2754024 2767598 utils.rb:     root 2216r      REG              253,0   99458001   37776091 /var/log/kube-apiserver/audit-2024-03-04T05-27-26.713.log (deleted)
      fluentd   2754024 2767598 utils.rb:     root 2295r      REG              253,0  104782793   37782886 /var/log/kube-apiserver/audit-2024-03-04T12-30-00.272.log (deleted)
      fluentd   2754024 2767598 utils.rb:     root 2314r      REG              253,0  103217901   37785868 /var/log/kube-apiserver/audit-2024-03-04T13-30-30.791.log (deleted)
      fluentd   2754024 2767598 utils.rb:     root 2315r      REG              253,0  104713321   37785876 /var/log/kube-apiserver/audit-2024-03-04T14-09-51.258.log (deleted)
      fluentd   2754024 2767598 utils.rb:     root 2318r      REG              253,0   98998625   37776086 /var/log/kube-apiserver/audit-2024-03-04T04-00-17.847.log (deleted)
      fluentd   2754024 2767598 utils.rb:     root 2320r      REG              253,0  104554304   37782871 /var/log/kube-apiserver/audit-2024-03-04T12-00-21.663.log (deleted)
      fluentd   2754024 2767598 utils.rb:     root 2323r      REG              253,0  104422225   37782868 /var/log/kube-apiserver/audit-2024-03-04T08-40-21.271.log (deleted)
      fluentd   2754024 2767598 utils.rb:     root 2331r      REG              253,0   99131351   37776074 /var/log/kube-apiserver/audit-2024-03-04T03-23-32.920.log (deleted)
      fluentd   2754024 2767598 utils.rb:     root 2345r      REG              253,0  104299871   37785874 /var/log/kube-apiserver/audit-2024-03-04T13-31-27.185.log (deleted)
      fluentd   2754024 2767598 utils.rb:     root 2359r      REG              253,0  104474582   37785863 /var/log/kube-apiserver/audit-2024-03-04T12-49-41.961.log (deleted)

      Version-Release number of selected component (if applicable):

      Openshift Logging 5.8.0 and Fluentd 1.16.2    

      Actual results:

      Collector bring a consistent, linear growth space utilization, primarily due to already deleted / unreleased files

      Expected results:

      The desirable outcomes for this bug would be the following two:
      (1) For fluentd & vector collectors to have `rotate_wait` implemented for kube-apiserver audit logs, as both fluentd [0] and vector [1] seem to be missing it today.
      
      (2) For the collector (i.e. either fluentd or vector) to implement a control loop where it releases 
      
      [0] https://github.com/openshift/cluster-logging-operator/blob/master/internal/generator/fluentd/source/audit_logs.go#L63-L83
      [1] https://github.com/openshift/cluster-logging-operator/blob/master/internal/generator/vector/conf/complex.toml#L45-L49
      [2] https://github.com/fluent/fluent-bit/blame/master/plugins/in_tail/tail_file.c#L1773

      Additional info:

      While [0] and [1] seem they have *fixed* this issue:
      - The changes brought in [0] only seem to re-base on Fluentd 1.16.2, which brings limited in_tail improvements [2]
      - While [1] seems to implement rotate_wait_ms, but **only** for kubernetes_logs [3]
      
      Based on the above, these two [4][5] KBs should be updated as well, with the above-highlighted information.
      
      [0] https://issues.redhat.com/browse/LOG-3949
      [1] https://issues.redhat.com/browse/LOG-4241
      [2] https://github.com/fluent/fluentd/releases/tag/v1.16.2
      [3] https://github.com/openshift/cluster-logging-operator/pull/2185/commits/f528d592a035aac5c579d83bfd2acafd6bd6c493#diff-fdcafccbe68df65cd72c64a6f2aa9ee3935826cdbcc8ee1718cf77bbf43c5582R32
      [4] https://access.redhat.com/solutions/7019864
      [5] https://access.redhat.com/solutions/7007644

            Unassigned Unassigned
            rhn-support-rsandu Robert Sandu
            Anping Li Anping Li
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: