Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-6968

Loki ingester stops sending heartbeats after communication has been disrupted for a while

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • Logging 6.4.0
    • Logging 6.2.0, Logging 6.0.6, Logging 6.1.4, Logging 6.3.0
    • Log Storage
    • Incidents & Support
    • 5
    • False
    • Hide

      None

      Show
      None
    • False
    • NEW
    • NEW
    • Bug Fix
    • Log Storage - Sprint 269, Logging - Sprint 279

      Description of problem:

      Loki's components use a gossiping protocol to keep track of the availability of each other ("memberlist"). This functionality is, for example, used to identify which of the ingesters are reachable.

      To keep track of reachability, each component sends "heartbeat" messages to the other components, which then update a timestamp to indicate when the last heartbeat was received. Once this timestamp is older than a "heartbeat timeout" the component is marked as "UNHEALTHY" and is not considered active anymore.

      Components should continue to send heartbeat messages to unhealthy nodes, so that it's possible for the system to automatically recover from a disruption.

      This mechanism seems to work for a while after the communication has been disrupted, but we have observed that the ingesters will stop sending heartbeat messages to other ingesters marked "UNHEALTHY" after a while (in the range of 5-6 minutes after the node has been marked as unhealthy).

      Because this is true for both sides, when a connectivity issue persists too long it is not possible for the memberlist to automatically recover.

      Version-Release number of selected component (if applicable):

      Loki Operator 6.2.0

      How reproducible:

      Steps to Reproduce:

      1. Create a LokiStack with three ingesters
      2. Disrupt the network communication for one of the ingesters
      3. (depending on how the network was disrupted, the ingester might now show error messages indicating network problems when sending heartbeats)
      4. Wait until the heartbeat-timeout elapses (1 minute)
      5. See the ingester set to UNHEALTHY
      6. Wait another 5-6 minutes
      7. (error messages about heartbeat messages should now stop)
      8. Remove the network disruption
      9. Ingester will remain in UNHEALTHY state even though network communication is fine

      Actual results:

      memberlist does not return to a normal state even though network disruption has disappeared.

      Expected results:

      memberlist should be able to automatically recover from a network communication issue, even if it persists for more than a few minutes.

      Additional info:

       Example LokiStack:

      apiVersion: loki.grafana.com/v1
      kind: LokiStack
      metadata:
        name: lokistack-dev
        namespace: openshift-logging
      spec:
        limits:
          global:
            retention:
              days: 30
        size: 1x.demo
        storage:
          schemas:
          - version: v13
            effectiveDate: 2024-06-01
          secret:
            name: test
            type: s3
        template:
          ingester:
            replicas: 3
        storageClassName: gp3-csi
        tenants:
          mode: openshift-logging
      

      Example command to disrupt network communication:

      # Get shell on a node where ingester is running on
      oc debug node/<name>
      chroot /host
      # Identify ingester PIDs
      ps axfu | grep -- -target=ingester
      # Select one of the PIDs and inject an iptables rule
      nsenter -t <pid> -n iptables -I OUTPUT  -p tcp -m tcp --dport 7946 -j DROP
      # Remove rule to allow communication again
      nsenter -t <pid> -n iptables -D OUTPUT  -p tcp -m tcp --dport 7946 -j DROP
      

              rojacob@redhat.com Robert Jacob
              rojacob@redhat.com Robert Jacob
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: