-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
Logging 6.2.3
-
Incidents & Support
-
False
-
-
False
-
NEW
-
NEW
-
Bug Fix
-
-
-
Important
Description of problem:
In Loki 6.2.3 where it's added the feature "autoforget_unhealthy", it's observed that the second Loki Ingester is 1/1, but it's not able to "autoforget" the unhealthy ingester throwing the next errors:
$ oc logs logging-loki-ingester-1|grep "found an existing instance"|tail -1 2025-07-10T11:08:04.944659176+05:30 level=warn ts=2025-07-10T05:38:04.847930321Z caller=lifecycler.go:295 component=ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 10.131.2.107:9095 past heartbeat timeout" $ oc logs logging-loki-ingester-1|grep "autoforget have seen"|tail -1 2025-07-10T14:36:21.933243008+05:30 level=warn ts=2025-07-10T09:06:21.871959622Z caller=ingester.go:491 component=ingester msg="autoforget have seen 1 unhealthy ingesters out of 2, network may be partioned, skip forgeting ingesters this round"
As the number of Loki Ingesters is 2, then, logs are not stored in Loki as the replication factor is by default 2
$ oc get cm logging-loki-config -o jsonpath='{.data.config\.yaml}'|grep -i replication_factor
replication_factor: 2
The Loki distributor log indicating that not enough Loki Ingester replicas:
$ oc logs logging-loki-distributor-744f448548-p9fng |grep "at least 2 live replicas required"|tail -1 2025-07-10T14:36:20.132251245+05:30 level=warn ts=2025-07-10T09:06:20.062295748Z caller=logging.go:128 orgID=infrastructure msg="POST /loki/api/v1/push (500) 1.789297ms Response: \"at least 2 live replicas required, could only find 1 - unhealthy instances: x.x.x.x:9095\\n\" ws: false; Accept-Encoding: identity; Content-Encoding: snappy; Content-Length: 91598; Content-Type: application/x-protobuf; User-Agent: Vector/0.37.1 (x86_64-unknown-linux-gnu); X-Forwarded-For: x.x.x.x; X-Forwarded-Prefix: /api/logs/v1/infrastructure; X-Scope-Orgid: infrastructure; "
Version-Release number of selected component (if applicable):
Loki 6.2.3
Loki size: 1x.small
How reproducible:
Not abe to reproducible until now
Steps to Reproduce:
Actual results:
The second Loki ingester is not able to join to the cluster with error:
$ oc logs logging-loki-ingester-1|grep "found an existing instance"|tail -1 2025-07-10T11:08:04.944659176+05:30 level=warn ts=2025-07-10T05:38:04.847930321Z caller=lifecycler.go:295 component=ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 10.131.2.107:9095 past heartbeat timeout" $ oc logs logging-loki-ingester-1|grep "autoforget have seen"|tail -1 2025-07-10T14:36:21.933243008+05:30 level=warn ts=2025-07-10T09:06:21.871959622Z caller=ingester.go:491 component=ingester msg="autoforget have seen 1 unhealthy ingesters out of 2, network may be partioned, skip forgeting ingesters this round"
Expected results:
The second Loki Ingester is forgotten and able to join to the cluster
Additional info:
*Data needed to provide*
- Enable log level debug in Loki. Follow the article https://access.redhat.com/solutions/7049665
- Get Loki Ring status. Follow step 1 from the Workaround in the Resolution section from the article - https://access.redhat.com/solutions/7122281
- Get a partial prometheus dump providing the chunk for the time that the previous steps are done. Follow the article access.redhat.com/solutions/5482971
- Provide a logging must-gather at the same time that being the issue. Follow the documentation - https://docs.redhat.com/en/documentation/openshift_container_platform/4.16/html/logging/support#cluster-logging-must-gather-collecting_cluster-logging-support