-
Bug
-
Resolution: Done
-
Blocker
-
Logging 5.7.0, Logging 5.8.0
-
False
-
None
-
False
-
NEW
-
VERIFIED
-
Before this change fluentd could incorrectly close files that it should be watching because of the way files were tracked. This change fixes that behavior to no longer rely strictly on full path when determining which nodes it is watching
-
Bug Fix
-
Proposed
-
-
-
Proposed
Description of problem:
Fluentd manages watch list with hash values generated from paths of each log files.
But the hash values becomes non-unique when log rotation got happened.
The hash values must be generated from inode numbers when follow_inodes is true.
Version-Release number of selected component (if applicable):
OpenShift v4.12.0 with Red Hat OpenShift Logging v5.6.0
How reproducible:
It sometimes happens when rotated files are closed by rotate_watch intervel.
Steps to Reproduce:
Step1. Install OpenShift Logging then deploy EFK stack on your cluster
Step2. Pick up one of nodes on OCP you uses for this test
Step3. Recreate collector pods and node-ca pods for refresh.
$ oc delete pods -n openshift-image-registry -l name=node-ca $ oc delete pods -n openshift-logging -l component=collector
Step4. Run the following script on your terminal
$ COLLECTOR=$(oc get pods -n openshift-logging -o wide | grep -w $NODE | awk '{print $1}') $ POD=$(oc get pods -n openshift-image-registry -o wide | grep node-ca | grep -w $NODE | awk '{print $1}') $ NAMESPACE=openshift-image-registry $ CONTAINER=node-ca $ oc exec -n openshift-logging $COLLECTOR -c collector -- \ bash -c 'pre_inode=0; while :; do date; sleep 5 & SLEEP_PID=$! echo "-- Pos file"; grep -E "/var/log/pods/'"$NAMESPACE"'_'"$POD"'_.*/'"$CONTAINER"'/" /var/lib/fluentd/pos/es-containers.log.pos \ | awk "{printf(\"%s\t%s\t%s\n\", \$1, strtonum(\"0x\"\$2), strtonum(\"0x\"\$3))}"; echo "-- Ls"; ls -lditr /var/log/pods/'"$NAMESPACE"'_'"$POD"'_*/'"$CONTAINER"'/*; echo "-- Lsof"; lsof -p 1 -nP | grep "/var/log/pods/'"$NAMESPACE"'_'"$POD"'_.*/'"$CONTAINER"'/" > /tmp/lsof.txt if [ $(cat /tmp/lsof.txt | wc -l ) -eq 0 ] then echo "The issue got reproduced." kill $SLEEP_PID break fi cat /tmp/lsof.txt read inode path < <(ls -itrd /var/log/pods/'"$NAMESPACE"'_'"$POD"'_*/'"$CONTAINER"'/*.log.????????-?????? | tail -n 1); if [ "$inode" != "" ] && [ "$inode" != "$pre_inode" ] then pre_inode="$inode"; echo "--- New log rotate detected: $path"; fi; echo; wait done' && pkill -f "$POD" | tee pos_watch_${COLLECTOR}.txt
Step5. Run the following script on another terminal
oc exec -it -n openshift-image-registry $POD -- bash -c ' count=0; while :; do DATE="$(date)"; for i in 1 2 3 4; do printf "$DATE: %3900d\n$DATE: %3900d\n$DATE: %3900d\n$DATE: %3900d\n$DATE: %3900d\n$DATE: %3900d\n" \ "$count" "$count" "$count" "$count" "$count" "$count" > /proc/1/fd/2 & done; ((count=count+1)); echo "$count times" if [ $(expr $count % 2000) -eq 0 ]; then echo "sleep 60s..."; sleep 60 & wait; fi; wait; done'; echo "Kill remaining processes"; oc exec -it -n openshift-image-registry $POD -- pkill -f DATE
Step6. Wait until the issue got reproduced (It might take 10-20 minutes)
Actual results:
The first script got exited with "The issue got reproduced" message.
Mon Feb 13 10:08:34 UTC 2023 -- Pos file /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log 18446744073709551616 195064732 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log 18446744073709551616 195064733 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log 58444023 195064732 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log 35228450 195064763 -- Ls 195064725 -rw-r--r--. 1 root root 231685 Feb 13 10:06 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100423.gz 195064724 -rw-r--r--. 1 root root 216644 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100625.gz 195064729 -rw-r--r--. 1 root root 225105 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100716.gz 195064732 -rw-------. 1 root root 58444023 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100758 195064763 -rw-------. 1 root root 39091784 Feb 13 10:08 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log -- Lsof The issue got reproduced.
After that you can check that fluentd has already closed /var/log/pods/.../node-ca/0.log although the log file still exists.
$ oc exec -it -n openshift-logging $COLLECTOR -c collector -- bash -c "ls -li /var/log/pods/openshift-image-registry_node-ca-*/node-ca/*" 195064763 -rw-------. 1 root root 39092337 Feb 13 10:08 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log 195064725 -rw-r--r--. 1 root root 231685 Feb 13 10:06 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100423.gz 195064724 -rw-r--r--. 1 root root 216644 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100625.gz 195064729 -rw-r--r--. 1 root root 225105 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100716.gz 195064732 -rw-------. 1 root root 58444023 Feb 13 10:07 /var/log/pods/openshift-image-registry_node-ca-<xxxxx>/node-ca/0.log.20230213-100758 $ oc exec -it -n openshift-logging $COLLECTOR -c collector -- lsof -nP -p 1 | grep node-ca || echo "Not Found" Not Found
Expected results:
Fluentd doesn't close /var/log/pods/.../node-ca/0.log since it still exists.