Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-7751

[release-6.3] Vector stops of log forwarding when the TCP session is killed

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • Logging 6.3.2
    • Logging 5.8.z, Logging 5.9.z, Logging 6.0.z, Logging 6.1.z, Logging 6.2.z, Logging 6.3.z, Logging 6.4.z
    • Log Collection
    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • False
    • NEW
    • VERIFIED
    • Before this fix, Vector could not recover from silently closed TCP connections. With this fix, Vector now uses keepalive probes to detect and automatically re-establish unresponsive TCP connections.
    • Bug Fix
    • Logging - Sprint 277, Logging - Sprint 278
    • Critical

      Description of problem:

      When it's configured to log forward to syslog (socket sink) and the TCP session is killed/dropped for any reason (firewall/load balancer/etc), it's not observed any error in Vector, but logs are not sent and Vector is not able to log forward until the collector pods are restarted

      This issue is confirmed in upstream in https://github.com/vectordotdev/vector/issues/4933.

      Version-Release number of selected component (if applicable):

      All Logging versions using Vector and syslog output (socket sink using TCP)

      How reproducible:

      Detailed in the upstream bug.

      Steps to Reproduce:

      1. The steps for reproducing it are detailed in the upstream issue https://github.com/vectordotdev/vector/issues/4933#issuecomment-1185617943

      Actual results:

      Vector stops of log forwarding logs with half closed network connection and not retrying until the collector pods are restarted and new TCP connections are created

      Expected results:

      Vector is aware that the TCP connection doesn't work and it creates a new TCP Connections.

      Additional info:

      Not tested, but the same should impact to other sinks using TCP protocol as it could be Elasticsearch as Vector has not implemented TCP Keepalive

      Workaround

      Restart the collector pods for creating new TCP Communications or:

      1. Set the variables

          $ cr="collector"
          $ ns="openshift-logging"
      

      2. Move to "Unmanaged" the Cluster Logging CR

          $ oc -n $ns patch obsclf/$cr -n $ns -p '{"spec":{"managementState": "Unmanaged"}}' --type=merge
          clusterlogforwarder.observability.openshift.io/collector patched
      

      3. Extract the collector configmap. This extract the files "run-vector.sh" and "vector.toml"

          $ mkdir config
          $ cd config/
          $ oc extract cm/$cr-config -n $ns
          run-vector.sh
          vector.toml
      

      4. Modify the "vector.toml"

          $ servers=$(oc get obsclf/$cr -n $ns -o jsonpath='\{.spec.outputs[?(.type=="syslog")].name}')
          $ for server in $(echo $servers|tr "-" "_"); do echo $server; sed -i  "/sinks\.output_$server\]/a keepalive.time_secs = 60" vector.toml ; done 
      

      5. Delete the current Vector configuration

          $ oc delete cm $cr-config -n $ns 
      

      6. Recreate the configmap

          $ oc create configmap $cr --from-file=run-vector.sh --from-file=vector.toml -n $ns
      

      7. Restart the collector pods for using the new configuration

          $ oc delete pods -l app.kubernetes.io/component=collector -n $ns

              rh-ee-calee Calvin Lee
              rhn-support-ocasalsa Oscar Casal Sanchez
              Kabir Bharti Kabir Bharti
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: