-
Bug
-
Resolution: Unresolved
-
Major
-
Logging 6.2.0, Logging 6.0.6, Logging 6.1.4, Logging 6.3.0
-
Incidents & Support
-
5
-
False
-
-
False
-
NEW
-
NEW
-
Bug Fix
-
-
-
Log Storage - Sprint 269, Logging - Sprint 279
Description of problem:
Loki's components use a gossiping protocol to keep track of the availability of each other ("memberlist"). This functionality is, for example, used to identify which of the ingesters are reachable.
To keep track of reachability, each component sends "heartbeat" messages to the other components, which then update a timestamp to indicate when the last heartbeat was received. Once this timestamp is older than a "heartbeat timeout" the component is marked as "UNHEALTHY" and is not considered active anymore.
Components should continue to send heartbeat messages to unhealthy nodes, so that it's possible for the system to automatically recover from a disruption.
This mechanism seems to work for a while after the communication has been disrupted, but we have observed that the ingesters will stop sending heartbeat messages to other ingesters marked "UNHEALTHY" after a while (in the range of 5-6 minutes after the node has been marked as unhealthy).
Because this is true for both sides, when a connectivity issue persists too long it is not possible for the memberlist to automatically recover.
Version-Release number of selected component (if applicable):
Loki Operator 6.2.0
How reproducible:
Steps to Reproduce:
- Create a LokiStack with three ingesters
- Disrupt the network communication for one of the ingesters
- (depending on how the network was disrupted, the ingester might now show error messages indicating network problems when sending heartbeats)
- Wait until the heartbeat-timeout elapses (1 minute)
- See the ingester set to UNHEALTHY
- Wait another 5-6 minutes
- (error messages about heartbeat messages should now stop)
- Remove the network disruption
- Ingester will remain in UNHEALTHY state even though network communication is fine
Actual results:
memberlist does not return to a normal state even though network disruption has disappeared.
Expected results:
memberlist should be able to automatically recover from a network communication issue, even if it persists for more than a few minutes.
Additional info:
Example LokiStack:
apiVersion: loki.grafana.com/v1 kind: LokiStack metadata: name: lokistack-dev namespace: openshift-logging spec: limits: global: retention: days: 30 size: 1x.demo storage: schemas: - version: v13 effectiveDate: 2024-06-01 secret: name: test type: s3 template: ingester: replicas: 3 storageClassName: gp3-csi tenants: mode: openshift-logging
Example command to disrupt network communication:
# Get shell on a node where ingester is running on oc debug node/<name> chroot /host # Identify ingester PIDs ps axfu | grep -- -target=ingester # Select one of the PIDs and inject an iptables rule nsenter -t <pid> -n iptables -I OUTPUT -p tcp -m tcp --dport 7946 -j DROP # Remove rule to allow communication again nsenter -t <pid> -n iptables -D OUTPUT -p tcp -m tcp --dport 7946 -j DROP
- relates to
-
LOG-6987 Change Loki configuration to update memberlist ring when ingester becomes unhealthy
-
- Closed
-
- links to