Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-8345

[Logging6.4][Vector] Log forwarding stops for about 15 minutes after TCP session is killed

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • Log Collection
    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • False
    • NEW
    • NEW
    • Bug Fix

      Description of problem:

      This issue similar to LOG-7502, but the previous fix did not cover all cases. When the TCP session to the syslog server is killed, log forwarding stops for about 15 minutes. This happens at least in configurations where the syslog server uses active/standby nodes behind a load balancer. The issue is that detection of the broken connection and recovery take a long time, which makes monitoring and troubleshooting difficult.

      Version-Release number of selected component (if applicable):

      All latest Logging versions using Vector and syslog output (socket sink using TCP)

      How reproducible:

      Always

      Steps to Reproduce:

      1.  Configure a syslog server with two nodes in active/standby mode
      2.  Create a Kubernetes Service or external load balancer to route traffic to the syslog servers
      3. Configure ClusterLogForwarder to send logs to an external syslog server using TCP
      4. Deploy an application that generates logs every second
      5. Confirm logs are forwarded to the syslog server
      6. Kill the TCP session by shutting down the active syslog server node

      Actual results:

      • Log forwarding stops after the TCP session is killed.
      • Recovery happens only after OS TCP timeout (~15 minutes) or collector pod restart.

      Expected results:

      • Vector should detect the broken TCP session quickly and reconnect to the syslog server.
      • Log forwarding should resume automatically without manual pod restart.

      Additional info:

      • We also tested with a Kubernetes Service acting as a load balancer for syslog servers in active/standby configuration, and the same issue occurred.
      • In cases with an active connection, the option introduced in LOG-7502 (keepalive.time_secs) did not help.

              Unassigned Unassigned
              kkawakam@redhat.com KATSUYA KAWAKAMI
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: