-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
Incidents & Support
-
False
-
-
False
-
NEW
-
NEW
-
Bug Fix
Description of problem:
This issue similar to LOG-7502, but the previous fix did not cover all cases. When the TCP session to the syslog server is killed, log forwarding stops for about 15 minutes. This happens at least in configurations where the syslog server uses active/standby nodes behind a load balancer. The issue is that detection of the broken connection and recovery take a long time, which makes monitoring and troubleshooting difficult.
Version-Release number of selected component (if applicable):
All latest Logging versions using Vector and syslog output (socket sink using TCP)
How reproducible:
Always
Steps to Reproduce:
- Configure a syslog server with two nodes in active/standby mode
- Create a Kubernetes Service or external load balancer to route traffic to the syslog servers
- Configure ClusterLogForwarder to send logs to an external syslog server using TCP
- Deploy an application that generates logs every second
- Confirm logs are forwarded to the syslog server
- Kill the TCP session by shutting down the active syslog server node
Actual results:
- Log forwarding stops after the TCP session is killed.
- Recovery happens only after OS TCP timeout (~15 minutes) or collector pod restart.
Expected results:
- Vector should detect the broken TCP session quickly and reconnect to the syslog server.
- Log forwarding should resume automatically without manual pod restart.
Additional info:
- We also tested with a Kubernetes Service acting as a load balancer for syslog servers in active/standby configuration, and the same issue occurred.
- In cases with an active connection, the option introduced in
LOG-7502(keepalive.time_secs) did not help.