Type: Bug
Resolution: Done
Priority: Blocker
Fix Version/s: Logging 5.7.2
Affects Version/s: Logging 5.7.0, Logging 5.8.0
Component/s: Log Collection
Labels:
- devel_ack+
- need-test-case

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs QE Status:
NEW
QE Status:
VERIFIED
Release Note Text:
Before this change fluentd could incorrectly close files that it should be watching because of the way files were tracked. This change fixes that behavior to no longer rely strictly on full path when determining which nodes it is watching
Release Note Type:
Bug Fix
Release Note Status:
Proposed
Intelligence Requested:
Market:

Release Blocker:
Proposed

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

Fluentd manages watch list with hash values generated from paths of each log files.
But the hash values becomes non-unique when log rotation got happened.

The hash values must be generated from inode numbers when follow_inodes is true.

Version-Release number of selected component (if applicable):

OpenShift v4.12.0 with Red Hat OpenShift Logging v5.6.0

How reproducible:

It sometimes happens when rotated files are closed by rotate_watch intervel.

Steps to Reproduce:

Step1. Install OpenShift Logging then deploy EFK stack on your cluster

Step2. Pick up one of nodes on OCP you uses for this test

Step3. Recreate collector pods and node-ca pods for refresh.

$ oc delete pods -n openshift-image-registry -l name=node-ca
$ oc delete pods -n openshift-logging -l component=collector

Step4. Run the following script on your terminal

$ COLLECTOR=$(oc get pods -n openshift-logging -o wide | grep -w $NODE | awk '{print $1}')
$ POD=$(oc get pods -n openshift-image-registry -o wide | grep node-ca | grep -w $NODE | awk '{print $1}')
$ NAMESPACE=openshift-image-registry
$ CONTAINER=node-ca

$ oc exec -n openshift-logging $COLLECTOR -c collector -- \
   bash -c 'pre_inode=0;
            while :;
            do
              date;
              sleep 5 &
              SLEEP_PID=$!
              echo "-- Pos file";
              grep -E "/var/log/pods/'"$NAMESPACE"'_'"$POD"'_.*/'"$CONTAINER"'/" /var/lib/fluentd/pos/es-containers.log.pos \
                | awk "{printf(\"%s\t%s\t%s\n\", \$1, strtonum(\"0x\"\$2), strtonum(\"0x\"\$3))}";
              echo "-- Ls";
              ls -lditr /var/log/pods/'"$NAMESPACE"'_'"$POD"'_*/'"$CONTAINER"'/*;
              echo "-- Lsof";
              lsof -p 1 -nP | grep "/var/log/pods/'"$NAMESPACE"'_'"$POD"'_.*/'"$CONTAINER"'/" > /tmp/lsof.txt
              if [ $(cat /tmp/lsof.txt | wc -l ) -eq 0 ]
              then
                 echo "The issue got reproduced." 
                 kill $SLEEP_PID
                 break
              fi
              cat /tmp/lsof.txt
              read inode path < <(ls -itrd /var/log/pods/'"$NAMESPACE"'_'"$POD"'_*/'"$CONTAINER"'/*.log.????????-?????? | tail -n 1);
              if [ "$inode" != "" ] && [ "$inode" != "$pre_inode" ]
              then
                  pre_inode="$inode";
                  echo "--- New log rotate detected: $path";
              fi;
              echo; wait
            done' && pkill -f "$POD"  | tee pos_watch_${COLLECTOR}.txt

Step5. Run the following script on another terminal

oc exec -it -n openshift-image-registry $POD -- bash -c '
  count=0;
  while :;
  do
    DATE="$(date)";
    for i in 1 2 3 4;
    do
      printf "$DATE: %3900d\n$DATE: %3900d\n$DATE: %3900d\n$DATE: %3900d\n$DATE: %3900d\n$DATE: %3900d\n" \
             "$count" "$count" "$count" "$count" "$count" "$count" > /proc/1/fd/2 &
    done;
    ((count=count+1));
    echo "$count times" 
    if [ $(expr $count % 2000) -eq 0 ];
    then
      echo "sleep 60s...";
      sleep 60 & wait;
    fi;
    wait;
  done'; echo "Kill remaining processes"; oc exec -it -n openshift-image-registry $POD -- pkill -f DATE

Step6. Wait until the issue got reproduced (It might take 10-20 minutes)

Actual results:

The first script got exited with "The issue got reproduced" message.

Mon Feb 13 10:08:34 UTC 2023
-- Pos file
/var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log 18446744073709551616    195064732
/var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log 18446744073709551616    195064733
/var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log 58444023        195064732
/var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log 35228450        195064763
-- Ls
195064725 -rw-r--r--. 1 root root   231685 Feb 13 10:06 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100423.gz
195064724 -rw-r--r--. 1 root root   216644 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100625.gz
195064729 -rw-r--r--. 1 root root   225105 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100716.gz
195064732 -rw-------. 1 root root 58444023 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100758
195064763 -rw-------. 1 root root 39091784 Feb 13 10:08 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log
-- Lsof
The issue got reproduced.

After that you can check that fluentd has already closed /var/log/pods/.../node-ca/0.log although the log file still exists.

$ oc exec -it -n openshift-logging $COLLECTOR -c collector -- bash -c "ls -li /var/log/pods/openshift-image-registry_node-ca-*/node-ca/*" 
195064763 -rw-------. 1 root root 39092337 Feb 13 10:08 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log
195064725 -rw-r--r--. 1 root root   231685 Feb 13 10:06 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100423.gz
195064724 -rw-r--r--. 1 root root   216644 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100625.gz
195064729 -rw-r--r--. 1 root root   225105 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100716.gz
195064732 -rw-------. 1 root root 58444023 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100758

$ oc exec -it -n openshift-logging $COLLECTOR -c collector -- lsof -nP -p 1 | grep node-ca || echo "Not Found" 
Not Found

Expected results:

Fluentd doesn't close /var/log/pods/.../node-ca/0.log since it still exists.

links to

openshift/openshift-docs#60918: OBSDOCS-218 - Logging 5.7.2 Release Notes

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates