[LOG-6348] Collector startup script removes buffer lock files breaking locking code of vector - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: Logging 6.0.4
Affects Version/s: Logging 6.0.1
Component/s: Log Collection
Labels:
- devel_ack+

Blocked:
False
Blocked Reason:
None
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW
Release Note Text:
Before this fix, upon Vector startup, the startup script attempted to delete buffer lock files. With this fix, that step is removed.
Release Note Type:
Bug Fix

Sprint:
Log Collection - Sprint 262, Log Collection - Sprint 263, Log Collection - Sprint 264, Log Collection - Sprint 265, Log Collection - Sprint 266

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

The version of vector distributed by the Cluster Logging Operator is started by a startup script, that, among other things, removes the lock files used by vector to protect buffer files from being accessed by multiple instances of vector. The startup script tries to work around this issue by delaying the start of vector with a fixed delay, currently ten seconds.

Deleting the lock files removes the ability from vector to detect if other instances are running and might possibly cause data corruption if two instances of vector are successful in writing buffer files at the same time.

Testing shows that this error condition can be reached in rare cases involving a lot of redeployments of the vector pods. It should also be possible to reproduce this issue by creating two sets of collector pods that internally use the same location for the buffer files. That configuration should not be possible without manually modifying the DaemonSet resources created by the Cluster Logging Operator.

It should be tested, whether it is possible to remove the part of the startup script that removes the buffer lock files, so that vector can make use of the locks to make sure no two instances are running at the same time. It's possible that we can get rid of the startup script completely.

Version-Release number of selected component (if applicable):

OpenShift v4.16.8
cluster-logging.v6.0.0
loki-operator.v6.0.0

How reproducible:

Rare (It happened once a few days by our reproduction steps)

Steps to Reproduce:

Actual results:

Vector pods started to crash, when buffer files had been overwritten by another vector instance.

Expected results:

Vector is able to guard itself against running multiple instances on the same storage location.

Additional information:

Preliminary tests with modifying the startup script, so that the buffer lock files stay intact look promising.
The first part of the startup script that creates the data directory seems to be unnecessary in the current configuration, because the directory is a VolumeMount in the container definition and so always exists.