Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: Logging 6.2.0, Logging 6.0.6, Logging 6.1.4, Logging 6.3.0
Component/s: Log Storage
Labels:
- devel_ack+

Activity Type:
Incidents & Support
Story Points:
5
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW
Release Note Type:
Bug Fix
Intelligence Requested:
Market:

Sprint:
Log Storage - Sprint 269, Logging - Sprint 282

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

Loki's components use a gossiping protocol to keep track of the availability of each other ("memberlist"). This functionality is, for example, used to identify which of the ingesters are reachable.

To keep track of reachability, each component sends "heartbeat" messages to the other components, which then update a timestamp to indicate when the last heartbeat was received. Once this timestamp is older than a "heartbeat timeout" the component is marked as "UNHEALTHY" and is not considered active anymore.

Components should continue to send heartbeat messages to unhealthy nodes, so that it's possible for the system to automatically recover from a disruption.

This mechanism seems to work for a while after the communication has been disrupted, but we have observed that the ingesters will stop sending heartbeat messages to other ingesters marked "UNHEALTHY" after a while (in the range of 5-6 minutes after the node has been marked as unhealthy).

Because this is true for both sides, when a connectivity issue persists too long it is not possible for the memberlist to automatically recover.

Version-Release number of selected component (if applicable):

Loki Operator 6.2.0

How reproducible:

Steps to Reproduce:

Create a LokiStack with three ingesters
Disrupt the network communication for one of the ingesters
(depending on how the network was disrupted, the ingester might now show error messages indicating network problems when sending heartbeats)
Wait until the heartbeat-timeout elapses (1 minute)
See the ingester set to UNHEALTHY
Wait another 5-6 minutes
(error messages about heartbeat messages should now stop)
Remove the network disruption
Ingester will remain in UNHEALTHY state even though network communication is fine

Actual results:

memberlist does not return to a normal state even though network disruption has disappeared.

Expected results:

memberlist should be able to automatically recover from a network communication issue, even if it persists for more than a few minutes.

Additional info:

Example LokiStack:

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: lokistack-dev
  namespace: openshift-logging
spec:
  limits:
    global:
      retention:
        days: 30
  size: 1x.demo
  storage:
    schemas:
    - version: v13
      effectiveDate: 2024-06-01
    secret:
      name: test
      type: s3
  template:
    ingester:
      replicas: 3
  storageClassName: gp3-csi
  tenants:
    mode: openshift-logging

Example command to disrupt network communication:

# Get shell on a node where ingester is running on
oc debug node/<name>
chroot /host
# Identify ingester PIDs
ps axfu | grep -- -target=ingester
# Select one of the PIDs and inject an iptables rule
nsenter -t <pid> -n iptables -I OUTPUT  -p tcp -m tcp --dport 7946 -j DROP
# Remove rule to allow communication again
nsenter -t <pid> -n iptables -D OUTPUT  -p tcp -m tcp --dport 7946 -j DROP

relates to

LOG-6987 Change Loki configuration to update memberlist ring when ingester becomes unhealthy

Closed

links to

[KCS] Loki Ingester 0/1

Assignee:: Robert Jacob

Reporter:: Robert Jacob

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/04/02 12:02 PM

Updated:: 2025/11/17 12:20 PM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates