Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-4221

[release-5.7] Fluentd wrongly closes a log file due to hash collision

XMLWordPrintable

    • False
    • None
    • False
    • NEW
    • VERIFIED
    • Before this change fluentd could incorrectly close files that it should be watching because of the way files were tracked. This change fixes that behavior to no longer rely strictly on full path when determining which nodes it is watching
    • Bug Fix
    • Proposed
    • Proposed

      Description of problem:

      Fluentd manages watch list with hash values generated from paths of each log files.
      But the hash values becomes non-unique when log rotation got happened.

      The hash values must be generated from inode numbers when follow_inodes is true.

      Version-Release number of selected component (if applicable):

      OpenShift v4.12.0 with Red Hat OpenShift Logging v5.6.0

      How reproducible:

      It sometimes happens when rotated files are closed by rotate_watch intervel.

      Steps to Reproduce:

      Step1. Install OpenShift Logging then deploy EFK stack on your cluster

      Step2. Pick up one of nodes on OCP you uses for this test

      Step3. Recreate collector pods and node-ca pods for refresh.

      $ oc delete pods -n openshift-image-registry -l name=node-ca
      $ oc delete pods -n openshift-logging -l component=collector
      

      Step4. Run the following script on your terminal

      $ COLLECTOR=$(oc get pods -n openshift-logging -o wide | grep -w $NODE | awk '{print $1}')
      $ POD=$(oc get pods -n openshift-image-registry -o wide | grep node-ca | grep -w $NODE | awk '{print $1}')
      $ NAMESPACE=openshift-image-registry
      $ CONTAINER=node-ca
      
      $ oc exec -n openshift-logging $COLLECTOR -c collector -- \
         bash -c 'pre_inode=0;
                  while :;
                  do
                    date;
                    sleep 5 &
                    SLEEP_PID=$!
                    echo "-- Pos file";
                    grep -E "/var/log/pods/'"$NAMESPACE"'_'"$POD"'_.*/'"$CONTAINER"'/" /var/lib/fluentd/pos/es-containers.log.pos \
                      | awk "{printf(\"%s\t%s\t%s\n\", \$1, strtonum(\"0x\"\$2), strtonum(\"0x\"\$3))}";
                    echo "-- Ls";
                    ls -lditr /var/log/pods/'"$NAMESPACE"'_'"$POD"'_*/'"$CONTAINER"'/*;
                    echo "-- Lsof";
                    lsof -p 1 -nP | grep "/var/log/pods/'"$NAMESPACE"'_'"$POD"'_.*/'"$CONTAINER"'/" > /tmp/lsof.txt
                    if [ $(cat /tmp/lsof.txt | wc -l ) -eq 0 ]
                    then
                       echo "The issue got reproduced." 
                       kill $SLEEP_PID
                       break
                    fi
                    cat /tmp/lsof.txt
                    read inode path < <(ls -itrd /var/log/pods/'"$NAMESPACE"'_'"$POD"'_*/'"$CONTAINER"'/*.log.????????-?????? | tail -n 1);
                    if [ "$inode" != "" ] && [ "$inode" != "$pre_inode" ]
                    then
                        pre_inode="$inode";
                        echo "--- New log rotate detected: $path";
                    fi;
                    echo; wait
                  done' && pkill -f "$POD"  | tee pos_watch_${COLLECTOR}.txt
      

      Step5. Run the following script on another terminal

      oc exec -it -n openshift-image-registry $POD -- bash -c '
        count=0;
        while :;
        do
          DATE="$(date)";
          for i in 1 2 3 4;
          do
            printf "$DATE: %3900d\n$DATE: %3900d\n$DATE: %3900d\n$DATE: %3900d\n$DATE: %3900d\n$DATE: %3900d\n" \
                   "$count" "$count" "$count" "$count" "$count" "$count" > /proc/1/fd/2 &
          done;
          ((count=count+1));
          echo "$count times" 
          if [ $(expr $count % 2000) -eq 0 ];
          then
            echo "sleep 60s...";
            sleep 60 & wait;
          fi;
          wait;
        done'; echo "Kill remaining processes"; oc exec -it -n openshift-image-registry $POD -- pkill -f DATE
      

      Step6. Wait until the issue got reproduced (It might take 10-20 minutes)

      Actual results:

      The first script got exited with "The issue got reproduced" message.

      Mon Feb 13 10:08:34 UTC 2023
      -- Pos file
      /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log 18446744073709551616    195064732
      /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log 18446744073709551616    195064733
      /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log 58444023        195064732
      /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log 35228450        195064763
      -- Ls
      195064725 -rw-r--r--. 1 root root   231685 Feb 13 10:06 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100423.gz
      195064724 -rw-r--r--. 1 root root   216644 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100625.gz
      195064729 -rw-r--r--. 1 root root   225105 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100716.gz
      195064732 -rw-------. 1 root root 58444023 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100758
      195064763 -rw-------. 1 root root 39091784 Feb 13 10:08 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log
      -- Lsof
      The issue got reproduced.
      

      After that you can check that fluentd has already closed /var/log/pods/.../node-ca/0.log although the log file still exists.

      $ oc exec -it -n openshift-logging $COLLECTOR -c collector -- bash -c "ls -li /var/log/pods/openshift-image-registry_node-ca-*/node-ca/*" 
      195064763 -rw-------. 1 root root 39092337 Feb 13 10:08 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log
      195064725 -rw-r--r--. 1 root root   231685 Feb 13 10:06 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100423.gz
      195064724 -rw-r--r--. 1 root root   216644 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100625.gz
      195064729 -rw-r--r--. 1 root root   225105 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100716.gz
      195064732 -rw-------. 1 root root 58444023 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100758
      
      $ oc exec -it -n openshift-logging $COLLECTOR -c collector -- lsof -nP -p 1 | grep node-ca || echo "Not Found" 
      Not Found
      

      Expected results:

      Fluentd doesn't close /var/log/pods/.../node-ca/0.log since it still exists.

              jcantril@redhat.com Jeffrey Cantrill
              rhn-support-kbharti Kabir Bharti
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: